ncborcherding / scRepertoire

A toolkit for single-cell immune profiling
https://www.borch.dev/uploads/screpertoire/
MIT License
311 stars 54 forks source link

combineTCR v/d/j/c_gene #355

Closed jakob-arnold closed 7 months ago

jakob-arnold commented 7 months ago

Hi,

I'm using scRepertoire v 2.0.0 and the combineTCR(). With this package version there are the gene columns "v_gene", "d_gene" and "j_gene" and the "chain" column. However, I think there should be 2 columns for each of those, right? "v_gene_1", "v_gene_2", etc. . In my specific case, I have sorted gd T cells and after running combineTCR() all "v_gene" columns have "TRDV..." values. For downstream analyses it may also be interesting to have the "TRGV..." information for each cell in a seperate column.

ncborcherding commented 7 months ago

Hey Jaakobb,

Thanks for reaching out - would you mind providing a little more information? What step are you at?

combineTCR() is taking the different TCR sequences and associating them with a single barcode/cell. The output of combineTCR() will put together the TRG and TRD chains into a clone (separated by an "_") and should not have these columns: "v_gene", "d_gene" and "j_gene", "chain"

Thanks, Nick

jakob-arnold commented 7 months ago

Hi Nick,

For context: I have a single clones.tsv file, which I obtained from the MiXCR pipeline. I ran the following two commands:

contig <- loadContigs(input = "./", format = "MiXCR") combined <- combineTCR(contig, filterMulti = T, removeNA = T)

And then colnames(combined$S1) gives me:

[1] "barcode" "chain" "reads" "v_gene" "d_gene" "j_gene" "c_gene" "cdr3_nt" "cdr3" [10] "TCR1" "cdr3_aa1" "cdr3_nt1" "TCR2" "cdr3_aa2" "cdr3_nt2" "CTgene" "CTnt" "CTaa" [19] "CTstrict"

I think it makes sense to have "v_gene" etc. in separate columns after combining TCR sequences, as that may be needed for some downstream applications.

ncborcherding commented 7 months ago

Hey jaakoobb,

You are completely correct - this is unintentional and should have been dropped.

Here is the code I used to make a reproducible example:

 MIXCR <- read.csv("https://www.borch.dev/uploads/contigs/MIXCR_contigs.csv")
contigs <- loadContigs(MIXCR, format = "MiXCR")
combined <- combineTCR(contigs)

Downstream of combineTCR() scRepertoire is only using "barcode", "CTgene", "CTnt", "CTaa", "CTstrict", so the appearance of "chain" "reads" "v_gene" "d_gene" "j_gene" "c_gene" "cdr3_nt" "cdr3" will not affect the analysis. But will work on pushing an update as soon as I can.

Nick

jakob-arnold commented 7 months ago

Hi Nick,

Thank you so much for the quick response.

Even though in the "normal" scRepertoire workflow these infos won't be necessary, I feel like some users may still need them for custom applications. Just as an example: If I'm combining the TCR data with the corresponding SeuratObject and I want to highlight in a DimPlot how the distribution of certain chains is across clusters. In my specific case the distribution of delta and gamma chains (TRDV1, TRGV1, ...) is of quite important biological significance. But I'm just suggesting things here, maybe there is a better way to achieve that :)

Thanks Jakob

ncborcherding commented 7 months ago

Hey Jakob,

Apologies for the confusion - the info are stored in "TCR1" "cdr3_aa1" "cdr3_nt1" "TCR2" "cdr3_aa2" "cdr3_nt2", whereas "chain" "reads" "v_gene" "d_gene" "j_gene" "c_gene" "cdr3_nt" "cdr3" is an accidental reminant from the original file.

Nick

ncborcherding commented 7 months ago

Got a tentative fix pushed to the dev branch - will still need to test it before it goes live. Thanks again for finding this issue!