liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
274 stars 47 forks source link

reference genome #111

Open christoforos-dimitropoulos opened 2 years ago

christoforos-dimitropoulos commented 2 years ago

Hey! I am planning to use TRUST4 in order to extract TCR sequences from bulk whole mRNA seq (single end). I am puzzled about what reference sequence to use for this. Since I would like to evaluate clonotype distribution across samples I would like to ideally extract VDJ sequences (from both alpha and beta chain?). I did not manage to find such a file in the IMGT database. I would really appreciate any imput on this since I am new to analyzing bulk seq. Thanks a lot :)

mourisl commented 2 years ago

The human and mouse IMGT files are provided in the github repository, and you can directly use it for the --ref option.

christoforos-dimitropoulos commented 2 years ago

Great thanks a lot, this worked:) I was just curious about the fact that my top more abundant clonotypes in my samples seem to be B cells, however the sequenced sample was sorted CD8 T cells. Are these false positive? Isn't it weird that they show up at the top of the list? Is this common?

mourisl commented 2 years ago

The most abundant clonotype should be true. The sorting usually may have some other cell types mixed with small proportion though. B cells usually have much higher expression levels of BCR genes, so even though the B cell proportion is low, its gene could be the highest expressed.

kespaG commented 1 year ago

Hi, Sorry for reviving this old thread. @mourisl First off, thank you for this great tool!

I have two questions about references: 1) Would you recommend to update the human reference before starting a new project?

2) When I used the provided perl script (BuildImgtAnnot.pl) I noticed that the representation of the constant regions are different in the new reference compared to the provided reference (human_IMGT+C.fa) (see examples below). Is this due to changes in the IMGT database or am I missing a step? Thanks in advance for your time!

Provided reference

IGKC chr2 88857161 88857683 - GAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAGCAGTTGAAATCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTATCCCAGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGTAA[...]

New reference

IGKC*01 ............CGAACTGTGGCTGCACCATCTGTCTTCATCTTCCCGCCATCTGATGAG CAGTTGAAA.........TCTGGAACTGCCTCTGTTGTGTGCCTGCTGAATAACTTCTAT CCC......AGAGAGGCCAAAGTACAGTGGAAGGTGGATAACGCCCTCCAATCGGGT... ...AACTCCCAGGAGAGTGTCACAGAGCAGGACAGCAAGGAC...............AGC ACCTACAGCCTCAGCAGCACCCTGACGCTGAGCAAAGCAGACTAC......GAGAAACAC AAAGTCTACGCCTGCGAAGTCACCCATCAGGGC......CTGAGCTCGCCCGTCACAAAG AGCTTCAACAGGGGAGAGTGT

mourisl commented 1 year ago

Hi @kespaG 1) I think the IMGT reference database is fairly stable, so there shouldn't be many differences using the updated database. The one provided in the package is for user convenience, and you can download a new one.

2) The constant gene sequences in the human_IMGT+C.fa provided in the package were extracted based on Gencode annotation on hg38 genome. The sequences are slightly longer than IMGT constant gene sequences and may provide better anchors for read extraction. But in practice, it should not cause many differences.