liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
283 stars 49 forks source link

Missing J sequences #223

Open Januaryyiyue opened 1 year ago

Januaryyiyue commented 1 year ago

Hello, I ran TRUST4 on my whole-genome sequencing data, and got the report.tsv file. Something from the report I don't get is why some clones only have the V gene usage information and not D and J gene usage information. There are also cases where the J gene information is available but not the V gene information.

Please find two examples below:

#count  frequency   CDR3nt  CDR3aa  V   D   J   C   cid cid_full_length
40  1.632653e-01    TGTATGATCGAGCACAGCAGAGCTTCTCATGCTGACACACACAGGTGG    CMIEHSRASHADTHRW    IGLV5-45*01 .   .   .   assemble123 0
3   3.333333e-01    TGTGACAATAACAATGACATGCGCTTT CDNNNDMRF   .   .   TRAJ43*01   TRAC    assemble2106    0

Could someone explain why this is the case? Thank you so much!

mourisl commented 1 year ago

It depends on the underlying contig, which may not have sufficient length overlap with V/J gene to get the annotation. Since your data is WGS, it is more likely these were from the non-recombined V,J genes, and their genomic sequence happens to contain the CDR3 motifs.

I'm working on a new feature to filter these false positive CDR3s from genomic regions, which could happen often in WGS data.