ncborcherding / scRepertoire

A toolkit for single-cell immune profiling
https://www.borch.dev/uploads/screpertoire/
MIT License
311 stars 54 forks source link

Issue Merging Seurat Data with the Combined TCR Object #358

Closed ghoeltzel38 closed 7 months ago

ghoeltzel38 commented 7 months ago

Hey thanks so much for designing this package for use. Greatly useful and I'm very appreciative.

I have a seurat object that ive annotated as specific CD4 subsets. I have the VDJ data for all of the cells that were originally present in the sequencing run. My first attempt was to filter the VDJ based on what barcodes I had in my seurat object with some code similar to following but for many different patients (12 in total with 2 sample runs per patient):

Patient 1

P1_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC1_biopsy_CD45p1.csv")) P1_T1_filt = P1_T1 %>% filter(barcode %in% cd4s_pimmune_3_TCR$TCR_X.cell) dim(P1_T1_filt) P1_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC1_biopsy_CD45p2.csv")) P1_T2_filt = P1_T2 %>% filter(barcode %in% cd4s_pimmune_3_TCR$TCR_X.cell)

cd4s_pimmune_TCR_combined = combineExpression(combined_orig_ident_filt, cd4s_pimmune_3_TCR, 
                                          cloneCall = "aa", 
                                          group.by = "sample",
                                          proportion = FALSE,
                                          cloneTypes = c(Single = 1, Small = 5, Medium = 20, Large = 100, Hyperexpanded = 500)
                                          )

when i call the combineExpression function to try to join the two, i get an error that not all cells are in the seurat object. but surely there are situations where individuals have QCed some Tcells out of their analysis and so the dimensions do not perfectly match? I am unable to get beyond this step, have been spending many hours trying to perfectly align the barcodes between the two objects. Also - i am confused as to how this function knows how to map the TCR data to the seurat meta data since group.by in this example is specified at the sample rather than at the cell level?

any insight would be greatly appreciated.

ncborcherding commented 7 months ago

Hey thanks for reaching out.

I have a seurat object that ive annotated as specific CD4 subsets. I have the VDJ data for all of the cells that were originally present in the sequencing run. My first attempt was to filter the VDJ based on what barcodes I had in my seurat object with some code similar to following but for many different patients (12 in total with 2 sample runs per patient)

You should be able just to load all the contig information, use combineTCR(), then combineExpression() on your filtered object - cd4s_pimmune_3_TCR.

P1_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC1_biopsy_CD45p1.csv"))
P1_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC1_biopsy_CD45p2.csv"))
contig.list <- list(P1_T1, P1_T2, ...)

combined.TCR <- combineTCR(contig.list, samples = XXX)

cd4s_pimmune_3_TCR <- combineExpression(combined_orig_ident_filt, cd4s_pimmune_3_TCR, 
                                        cloneCall = "aa", 
                                        group.by = "sample",
                                        proportion = FALSE,
                                        cloneTypes = c(Single = 1, Small = 5, Medium = 20, Large = 100, Hyperexpanded = 500)
                                                )

when i call the combineExpression function to try to join the two, i get an error that not all cells are in the seurat object.

What is the exact error you are getting?

Also what is the output of sessionInfo()?

combineExpression() should filter the combined contigs using the barcode in the single-cell object before adding the clonal information to the meta data.

Also - i am confused as to how this function knows how to map the TCR data to the seurat meta data since group.by in this example is specified at the sample rather than at the cell level?

group.by is just modulating the calculation of clonalFrequency, clonalProportion, and clonalSize. The attachment of the data are the exact same whether group.by = X or NULL.

Thanks, Nick

ghoeltzel38 commented 7 months ago

Hey Nick thanks so much for your response!

This was my general approach (the rownames from the seurat object did not match that of the VDJ data 'barcode' column so i just tried to wrangle/clean the data to accomplish that assuming it was necessary to properly merge the data. To do that I pasted some other columns together to recreate the barcode present in the VDJ filtered contig file.

cd4s_pimmune_3 = readRDS("cd4s_pimmune_3.rds")

include only CD4s from the object that have matching barcodes with the TCR data

cd4s_pimmune_3$barcode_TCR = paste0(cd4s_pimmune3$orig.ident, "", cd4s_pimmune_3$TCR_X.cell) #try to match barcodes to that created from the TCR object to_exclude_NA = rownames(cd4s_pimmune_3@meta.data[grep("NA", cd4s_pimmune_3$barcode_TCR),]) to_exclude_hyphen = rownames(cd4s_pimmune_3@meta.data[grep("-$", cd4s_pimmune_3$barcode_TCR), ]) total_exclude = c(to_exclude_NA, to_exclude_hyphen) length(total_exclude) to_include = setdiff(rownames(cd4s_pimmune_3@meta.data), total_exclude) cd4s_pimmune_3_TCR = subset(cd4s_pimmune_3, cells = to_include)

load the TCR data

common_path = "/gpfs3/well/immune-rep/users/pri925/Gerard/Phase_2_Work/Data/PancrImmune_Only_Raw/PancrImmune_VDJ_Data/Raw_TCR_Files/"

Patient 1

P1_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC1_biopsy_CD45p1.csv"))

P1_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC1_biopsy_CD45p2.csv"))

Patient 2

P2_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC2_biopsy_CD45p1.csv"))

P2_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC2_biopsy_CD45p2.csv"))

Patient 3

P3_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC3_biopsy_CD45p1.csv"))

P3_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC3_biopsy_CD45p2.csv"))

Patient 4

P4_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC4_biopsy_CD45p1.csv"))

Patient 5

P5_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC5_biopsy_CD45p1.csv"))

Patient 6

P6_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC6_biopsy_CD45p1.csv"))

P6_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC6_biopsy_CD45p2.csv"))

Patient 7

P7_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_04_PDAC7_biopsy_CD45p1.csv"))

P7_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_05_PDAC7_biopsy_CD45p2.csv"))

Patient 8

P8_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC8_biopsy_CD45p1.csv"))

P8_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC8_biopsy_CD45p2.csv"))

Patient 9

P9_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC9_biopsy_CD45p1.csv"))

P9_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC9_biopsy_CD45p2.csv"))

Patient 10

P10_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC10_biopsy_CD45p1.csv"))

P10_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC10_biopsy_CD45p2.csv"))

Patient 11

P11_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC11_biopsy_CD45p1.csv"))

P11_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC11_biopsy_CD45p2.csv"))

Patient 12

P12_T1 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_01_PDAC12_biopsy_CD45p1.csv"))

P12_T2 = read_csv(paste0(common_path, "filtered_contig_annotations_TCR_02_PDAC12_biopsy_CD45p2.csv"))

contig_list1 = list(P1_T1, P1_T2, P2_T1, P2_T2, P3_T1, P3_T2, P4_T1, P5_T1, P6_T1, P6_T2, P7_T1, P7_T2, P8_T1, P8_T2, P9_T1, P9_T2, P10_T1, P10_T2, P11_T1, P11_T2, P12_T1, P12_T2)

combined_contig = combineTCR(contig_list1, samples = c("01_PDAC1_biopsy_CD45p1", "01_PDAC1_biopsy_CD45p2", "01_PDAC2_biopsy_CD45p1", "02_PDAC2_biopsy_CD45p2", "01_PDAC3_biopsy_CD45p1", "02_PDAC3_biopsy_CD45p2", "01_PDAC4_biopsy_CD45p1", "01_PDAC5_biopsy_CD45p1", "01_PDAC6_biopsy_CD45p1", "02_PDAC6_biopsy_CD45p2", "04_PDAC7_biopsy_CD45p1", "05_PDAC7_biopsy_CD45p2", "01_PDAC8_biopsy_CD45p1", "02_PDAC8_biopsy_CD45p2", "01_PDAC9_biopsy_CD45p1", "02_PDAC9_biopsy_CD45p2", "01_PDAC10_biopsy_CD45p1", "02_PDAC10_biopsy_CD45p2", "01_PDAC11_biopsy_CD45p1", "02_PDAC11_biopsy_CD45p2", "01_PDAC12_biopsy_CD45p1", "02_PDAC12_biopsy_CD45p2"))

cd4s_pimmune_3_TCR <- combineExpression(combined_contig, cd4s_pimmune_3_TCR, cloneCall = "aa", group.by = "sample", proportion = FALSE, cloneTypes = c(Single = 1, Small = 5, Medium = 20, Large = 100, Hyperexpanded = 500))

Error in $<-.data.frame(*tmp*, "cloneType", value = NA) : replacement has 1 row, data has 0

above is the current error i was getting. i then tried the filtering process and was getting the error that not all of the cells matched the cells in seurat object.

output of sessionInfo():

sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /gpfs3/apps/eb/2020b/skylake/software/FlexiBLAS/3.2.0-GCC-11.3.0/lib64/libflexiblas.so.3.2

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] scRepertoire_1.8.0 EnhancedVolcano_1.16.0 Hmisc_5.1-2 ggrepel_0.9.5 DESeq2_1.38.3
[6] SummarizedExperiment_1.28.0 Biobase_2.58.0 MatrixGenerics_1.10.0 GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
[11] IRanges_2.32.0 S4Vectors_0.36.2 BiocGenerics_0.44.0 lubridate_1.9.3 forcats_1.0.0
[16] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[21] tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0 Seurat_5.0.3 SeuratObject_5.0.1
[26] sp_2.1-3 matrixStats_1.3.0 rlang_1.1.3

loaded via a namespace (and not attached): [1] utf8_1.2.4 spatstat.explore_3.2-7 reticulate_1.36.0 tidyselect_1.2.1 RSQLite_2.3.6
[6] AnnotationDbi_1.60.2 htmlwidgets_1.6.4 grid_4.2.1 BiocParallel_1.32.6 Rtsne_0.17
[11] munsell_0.5.1 codetools_0.2-18 ica_1.0-3 future_1.33.2 miniUI_0.1.1.1
[16] withr_3.0.0 spatstat.random_3.2-3 colorspace_2.1-0 progressr_0.14.0 ggalluvial_0.12.5
[21] knitr_1.46 rstudioapi_0.16.0 SingleCellExperiment_1.20.1 ROCR_1.0-11 tensor_1.5
[26] listenv_0.9.1 GenomeInfoDbData_1.2.9 polyclip_1.10-6 farver_2.1.1 bit64_4.0.5
[31] parallelly_1.37.1 vctrs_0.6.5 generics_0.1.3 xfun_0.43 timechange_0.3.0
[36] R6_2.5.1 doParallel_1.0.17 graphlayouts_1.1.1 VGAM_1.1-6 locfit_1.5-9.5
[41] bitops_1.0-7 spatstat.utils_3.0-4 cachem_1.0.8 DelayedArray_0.24.0 vroom_1.6.5
[46] promises_1.3.0 scales_1.3.0 ggraph_2.2.1 nnet_7.3-17 gtable_0.3.4
[51] globals_0.16.3 goftest_1.2-3 spam_2.10-0 tidygraph_1.2.1 splines_4.2.1
[56] lazyeval_0.2.2 spatstat.geom_3.2-9 checkmate_2.1.0 reshape2_1.4.4 abind_1.4-5
[61] backports_1.4.1 httpuv_1.6.15 tools_4.2.1 cubature_2.0.4.4 RColorBrewer_1.1-3
[66] ggridges_0.5.6 Rcpp_1.0.12 plyr_1.8.9 base64enc_0.1-3 zlibbioc_1.44.0
[71] RCurl_1.98-1.14 rpart_4.1.16 deldir_2.0-4 viridis_0.6.2 pbapply_1.7-2
[76] cowplot_1.1.3 zoo_1.8-12 cluster_2.1.3 magrittr_2.0.3 data.table_1.15.4
[81] RSpectra_0.16-1 scattermore_1.2 SparseM_1.81 lmtest_0.9-40 RANN_2.6.1
[86] truncdist_1.0-2 fitdistrplus_1.1-11 gsl_2.1-7.1 hms_1.1.3 patchwork_1.2.0
[91] mime_0.12 evaluate_0.23 xtable_1.8-4 XML_3.99-0.16.1 fastDummies_1.7.3
[96] gridExtra_2.3 compiler_4.2.1 KernSmooth_2.23-20 crayon_1.5.2 htmltools_0.5.8.1
[101] mgcv_1.8-40 later_1.3.2 tzdb_0.4.0 Formula_1.2-4 powerTCR_1.18.0
[106] geneplotter_1.76.0 DBI_1.2.2 tweenr_1.0.2 MASS_7.3-57 Matrix_1.6-5
[111] permute_0.9-7 cli_3.6.2 evd_2.3-6 parallel_4.2.1 dotCall64_1.1-1
[116] igraph_2.0.3 pkgconfig_2.0.3 foreign_0.8-82 plotly_4.10.4 spatstat.sparse_3.0-3
[121] foreach_1.5.2 annotate_1.76.0 stringdist_0.9.8 XVector_0.38.0 digest_0.6.35
[126] sctransform_0.4.1 RcppAnnoy_0.0.22 vegan_2.6-2 spatstat.data_3.0-4 Biostrings_2.66.0
[131] rmarkdown_2.26 leiden_0.4.3.1 htmlTable_2.4.0 uwot_0.2.1 evmix_2.12
[136] shiny_1.8.1.1 lifecycle_1.0.4 nlme_3.1-158 jsonlite_1.8.8 viridisLite_0.4.2
[141] fansi_1.0.6 pillar_1.9.0 lattice_0.20-45 KEGGREST_1.38.0 fastmap_1.1.1
[146] httr_1.4.7 survival_3.3-1 glue_1.7.0 iterators_1.0.14 png_0.1-8
[151] bit_4.0.5 ggforce_0.3.3 stringi_1.8.3 blob_1.2.4 RcppHNSW_0.6.0
[156] memoise_2.0.1 irlba_2.3.5.1 future.apply_1.11.2

ghoeltzel38 commented 7 months ago

after getting the first error, i assumed part of the issue was that the rownames of my meta data were not the same as the barcode column of the TCR dataframe, so i recreated a matching barcode using the paste0 as shown above, then did the following which is where the 'all cells' errors shown below came into play:

rownames(cd4s_pimmune_3_TCR@meta.data) = cd4s_pimmune_3_TCR$barcode_TCR cd4s_pimmune_3_TCR <- combineExpression(combined_contig, cd4s_pimmune_3_TCR, cloneCall = "aa", group.by = "sample", proportion = FALSE, cloneTypes = c(Single = 1, Small = 5, Medium = 20, Large = 100, Hyperexpanded = 500))

Error in validObject(object = x) : invalid class “Seurat” object: 1: all cells in assays must be present in the Seurat object invalid class “Seurat” object: 2: all cells in assays must be present in the Seurat object invalid class “Seurat” object: 3: All cells in reductions must be present in the Seurat object invalid class “Seurat” object: 4: All cells in reductions must be present in the Seurat object invalid class “Seurat” object: 5: All cells in reductions must be present in the Seurat object invalid class “Seurat” object: 6: all cells in graphs must be present in the Seurat object invalid class “Seurat” object: 7: all cells in graphs must be present in the Seurat object invalid class “Seurat” object: 8: 'active.idents' must be named with cell names

ncborcherding commented 7 months ago

I think what is likely happening here is there is a version issue between scRepertoire v1.8 (built before v5 of Seurat) and the newest Seurat you have installed. They are likely not compatible. I would suggest updating to scRepertoire to v2 using:

remotes::install_github("ncborcherding/scRepertoire")

Nick

ghoeltzel38 commented 7 months ago

Hey Nick,

Thanks so much. I just updated but still getting the same error regarding 'all cells in assays must be present in the seurat object' . Sorry for the bother - trying to sort something for my PI by tomorrow AM and clearly struggling lol

cd4s_pimmune_3_TCR <- combineExpression(combined_contig, cd4s_pimmune_3_TCR, cloneCall = "aa",
group.by = "sample", proportion = FALSE, cloneSize = c(Single = 1, Small = 5, Medium = 20, Large = 100, Hyperexpanded = 500))

Error in validObject(object = x) : invalid class “Seurat” object: 1: all cells in assays must be present in the Seurat object invalid class “Seurat” object: 2: all cells in assays must be present in the Seurat object invalid class “Seurat” object: 3: All cells in reductions must be present in the Seurat object invalid class “Seurat” object: 4: All cells in reductions must be present in the Seurat object invalid class “Seurat” object: 5: All cells in reductions must be present in the Seurat object invalid class “Seurat” object: 6: all cells in graphs must be present in the Seurat object invalid class “Seurat” object: 7: all cells in graphs must be present in the Seurat object invalid class “Seurat” object: 8: 'active.idents' must be named with cell names

ncborcherding commented 7 months ago

Hey no problem - totally understand. This is a completely new error for me, so bear with me.

I see this issue here with the same error message: https://github.com/satijalab/seurat/issues/8297

Is your data a list or not a fully integrated/merged object? I do not think combineExpression() will work in that setting.

ghoeltzel38 commented 7 months ago

this is the sturcture of the object. I did merge and integrate it prior to trying to use screpertoire. perhaps need to revert back to an earlier version of seurat? Thanks for your timely responses - means a lot.

I do have an ADT assay in my seurat object as well, going to try to follow a similar approach to what was suggested in your link and see if i have any success.

image image
ncborcherding commented 7 months ago

That looks integrated -

Very strange because scRepertoire detects the input object (Seurat vs Single-cell experiment) and then appends the clonal information to the meta data. There is no interaction with assays, reductions, or graphs slots in a Seurat object.

Are you able to do your downstream analysis with Seurat with cd4s_pimmune_3_TCR? Like differential expression and other plotting functions? It might indicate that there is an issue with the Seurat object?

Happy to take a look a closer look if you are under time pressure (totally understand if you want to keep the data private) - but if you share with me the combined data and the seurat object, debugging would go fast I think. You can email me?

Nick

ncborcherding commented 7 months ago

Hey Gerard,

I will close this issue because I think we found the underlying problem - barcode mismatch. But I wanted to leave an explanation if other users find this.

When using Seurat for multiple sequencing runs, barcodes are automatically appended with a "_X" to prevent issues with duplicate barcodes across sequencing runs. This creates a barcode mismatch for the contig files and subsequent outputs of combineTCR() and combineBCR().

original: ACGTACGTACGTACGT-1
seurat-modified: ACGTACGTACGTACGT-1_1

Probably the easiest way to approach this mismatch is to modify the contig file or the outputs of combineTCR() and combineBCR(). There is some code for this on the scRepertoire FAQ. But also here is a loop that would do something similar across multiple runs.

for(i in seq_along(contig_list)) {
  contig_list[[i]]$barcode <- paste0(contig_list[[i]]$barcode, "_", i)
}

Hope that helps, Nick