liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
274 stars 47 forks source link

clonatype define #9

Closed toddey closed 4 years ago

toddey commented 4 years ago

Hi! I have a question about clonatype identification, I found the defination of clonatype from paper :A TCR clonotype is a unique nucleotide sequence that arises during the gene rearrangement process for that receptor. So I can directly use the TRUST4 called CDR3 dna sequence to denfine clonotypes, right? And I found TCR clone sequences have been presented in publications in many different forms, what the forms do you usually use? Thank you in advance for your reply.

mourisl commented 4 years ago

For us, we mostly use CDR3 dna as clonotypes, which is as you did. This is purely from computational perspective. For biology, it may make more sense use the amino acid sequence as clonotype. Furthermore, if the data has good coverage on T cells, you may use V gene id, J gene id and CDR3 as the clonotype.

mourisl commented 4 years ago

By default, TRUST4 also reports partial CDR3s, you may want to filter those out from the report file (just grep -v partial) to compute clonaltypes.

toddey commented 4 years ago

By default, TRUST4 also reports partial CDR3s, you may want to filter those out from the report file (just grep -v partial) to compute clonaltypes.

The out_of _frame CDR3 should also be removed, right?

mourisl commented 4 years ago

That's a bit different, for out_offrame and the CDR3s with stop codon (symbol ""), they are still the nucleotide sequence during gene rearrangement process for the receptor. I usual still include them as clonotypes, but this depend on the application. If you are interested more at the protein level, you should definitely remove those, since these CDR3s won't be able to translate into the real receptors.

toddey commented 4 years ago

That's a bit different, for out_offrame and the CDR3s with stop codon (symbol ""), they are still the nucleotide sequence during gene rearrangement process for the receptor. I usual still include them as clonotypes, but this depend on the application. If you are interested more at the protein level, you should definitely remove those, since these CDR3s won't be able to translate into the real receptors.

I am a little puzzled about partial CDR3. They are also naturally occurring, or just incomplete sequencing? And I saw the 2019 Nature genetics paper which applyed TRUST3.0,"Landscape of B cell immunity and related immune evasion in human cancers". In this paper, your pipeline for identification of B cell clusters:

  1. extract all the unique complete CDR3 sequence; 2.for each sequence, extract an octamer starting from the 1st position in the CDR3 as a motif; 3.for each unique motif, collect all the CDR3 aa sequences containg the motif. The sequence; constitute a B cell cluster. In this workflow, you use CDR3 octamer as clonatypes?
mourisl commented 4 years ago

Partial CDR3 means incomplete sequencing. But from TRUST4's results, many of the partial CDR3s are also from those V or J genes before recombination, so these should be used with caution. I'm considering remove partial CDR3 results from the report by default in the formal release in future.

I'm not on TRUST3's paper. From my experience, many partial CDR3s from TRUST3 become complete in TRUST4, so there is no need to just look for substrings (octamer).

toddey commented 4 years ago

Partial CDR3 means incomplete sequencing. But from TRUST4's results, many of the partial CDR3s are also from those V or J genes before recombination, so these should be used with caution. I'm considering remove partial CDR3 results from the report by default in the formal release in future.

I'm not on TRUST3's paper. From my experience, many partial CDR3s from TRUST3 become complete in TRUST4, so there is no need to just look for substrings (octamer).

Got it! Thanks for your patience!