liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
274 stars 47 forks source link

frequences in tsv file #37

Closed wxiang-us closed 3 years ago

wxiang-us commented 3 years ago

Hello,

Thanks for the great software. I have a question regarding the report in tsv format. Seems the values in column "frequency" sum up to 2, not 1. Could I understand why? Below is an example report; Maybe because it had both TCR & IG gene identified?

#count frequency CDR3nt CDR3aa V D J C 248 3.425694e-01 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGG CARGGSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 193 2.675387e-01 TGTCAGCAGTATGGTAGCTCACCCCTCACTTTC CQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 110 1.519400e-01 TGTCAGCAGTATGGTAGCTCACCCTCACTTTC out_of_frame IGKV3-20*01 * IGKJ4*01 IGKC 33 4.558200e-02 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGG CARGGSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHD 8 1.105018e-02 GTATCAACGCAGAGTACGGGGGATACAGCTATGGTTGACTACTGG partial * * IGHJ4*02 IGHM 7 9.668909e-03 TTTGTTCAGCAAGACAATGGAGAGCTCTCACTGTGGTGGACGTTC partial * * IGKJ1*01 IGKC 6 9.213089e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGGTGCTTTTGATATCTGG CARGGSGWYGGAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 5 7.818004e-03 TGTCAGCAGTATGGTTGCTCACCCCTCACTTTC CQQYGCSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 5 6.906364e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGGGGCCAAGGGACAATGGTCNNNNNNNCGGAGAGTCAGGTTTTTGTGCACCCCTTAATGGGGCCTCCCACAATGTGACTACTTTGACTACTGG partial IGHV4-34*01 * IGHJ4*02 * 5 6.906364e-03 GGTATCAACGCAGAGTACGGGACTTTC partial IGKV3-20*01 * IGKJ4*01 IGKC 5 6.906364e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGG CARGGSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 * 5 6.906364e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGATGCTTTTGATATCTGG out_of_frame IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 4 6.022349e-03 TATCAGCAGTATGGTAGCTCACCCCTCACTTTC YQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 4 5.525091e-03 TGTCAGCAGTATCGTAGCTCACCCCTCACT partial IGKV3-20*01 * * * 4 5.525091e-03 TGTCAGGTGTGGGATCTTAATAGTGATCTTTGGGTGTTC CQVWDLNSDLWVF IGLV3-21*02 * IGLJ3*02 IGLC 3 5.152147e-03 TGTGCGAGAGGCGGGAGTGGCTGGTATGGGGATGCTTTTGATATCTGG CARGGSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 3 4.986394e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGAATGCTTTTGATATCTGG CARGGSGWYGNAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 3 4.254320e-03 TGTCAGCAGTATGGTAGCTCACACCTCACTTTC CQQYGSSHLTF IGKV3-20*01 * IGKJ4*01 IGKC 3 4.143818e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTTGATATCTGG out_of_frame IGHV4-34*01 IGHD6-19*01 IGHJ3*02 * 3 4.143818e-03 ATCAACGCAGAGTACGGGGATGCTTTTGATATCTGG partial * * IGHJ3*02 IGHD 3 4.143818e-03 AGTACGGGGGATACAGCTATGGTTGACTACTGG partial * * IGHJ4*02 IGHD 3 4.143818e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCCTTTGATATCTGG CARGGSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 * 3 4.143818e-03 ATCAACGCAGAGTACGGGTGGTACGGGGATGCTTTTGATATCTGG partial * * IGHJ3*02 * 2 4.005691e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGTTTTTGATATCTGG CARGGSGWYGDVFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 2 3.991878e-03 TGAGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGG _ARGGSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 2 3.770874e-03 TGTGCGAGAGGCGGGAGTCGCTGGTACGGGGATGCTTTTGATATCTGG CARGGSRWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 2 3.660373e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGTGGATGCTTTTGATATCTGG CARGGSGWYVDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 2 3.397931e-03 TGTCAGCAGTATGGGAGCTCACCCCTCACTTTC CQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 2 3.328867e-03 GGTCAGCAGTATGGTAGCTCACCCCTCACTTTC GQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 2 2.803984e-03 TGTCAGCAGTATGGTAGCTCACCCCTCAGTTTC CQQYGSSPLSF IGKV3-20*01 * IGKJ4*01 IGKC 2 2.762545e-03 TGTCAGTAGTATGGTAGCTCACCCCTCACTTTC CQ_YGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 2 2.762545e-03 TGTCAGGTGTGGGATCTTAAT partial IGLV3-21*02 * * * 2 2.762545e-03 TGTGCGAGAGGCGGAGTGGCTGGTACGGGGATGCTTTTGATATCT partial IGHV4-34*01 * IGHJ3*02 * 1 2.016658e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGT CARGGSGWYGDAFDIC IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.961407e-03 TGTGCGAGATGCGGGAGTGGCTGGTACGGGAATGCTTTTGATATCTGG CARCGSGWYGNAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.906156e-03 TGTGCGAGAGGCAGGAGTGGCTGGTACGGGGATGCTTTTGATATCTGG CARGRSGWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.837093e-03 TGTGCGAGAGGCGGGAGTGACTGGTACGGGGATGCTTTTGATATCTGG CARGGSDWYGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.823280e-03 TGTGCGAGAGGCGGGAGTGGCTGATACGGGGATGCTTTTGATATCTGG CARGGSG_YGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.781842e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACCGGGATGCTTTTGATATCTGG CARGGSGWYRDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.781842e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACTGGGATGCTTTTGATATCTGG CARGGSGWYWDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.698965e-03 TGTGCGAGAGGCGGGAGTGGCTGGTTCGGGGATGCTTTTGATATCTGG CARGGSGWFGDAFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.505587e-03 TGTCAGCAGTATGGTAGCTCCCCCCTCACTTTC CQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 1 1.505587e-03 TGTCAGCAGTATGGTAGATAACCCCTCACTTTC CQQYGR_PLTF IGKV3-20*01 * IGKJ4*01 IGKC 1 1.491775e-03 TGTGCGAGGGGCGGGAGTGGCTGGTACGGGGATGTTTTTGATATCTGG CARGGSGWYGDVFDIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.491775e-03 TGTCAGCAGTATGGTAGCTCACCCCTCACCTTC CQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 1 1.395085e-03 TGTCAGCAGTATGGTAGCTCACCCCTCACTTTA CQQYGSSPLTL IGKV3-20*01 * IGKJ4*01 IGKC 1 1.395085e-03 TGTCAGCAGTATGGTAGCTCACCCCTCCCTTTC CQQYGSSPLPF IGKV3-20*01 * IGKJ4*01 IGKC 1 1.395085e-03 TGTCAGCAGTATGGTAGCTCACCTCTCACTTTC CQQYGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 1 1.381273e-03 AGGAACCATTGTGTGTACACTTTT partial * * IGKJ2*01 IGKC 1 5.000000e-01 CGCTGAGGTTTTTGGAACGTCCTCAAGTGCGGTGACACCGATAAACTCATCTTA partial * * TRDJ1*01 TRDC 1 1.381273e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGTGGATGCTTAGATATCTGG out_of_frame IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.381273e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATCCCTTT partial IGHV4-34*01 * * * 1 1.381273e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTTGAGATCTGG CARGGSGWYGDAFEIW IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 1.381273e-03 TGTGCGAGAGGCGGGAGTGGCTGGTACGGGGATGCTTTGATATCTGG out_of_frame IGHV4-34*01 IGHD6-19*01 IGHJ3*02 IGHM 1 5.000000e-01 CGCTGAGGTTTTTGGAACGTCCTCAAGTGCTGTGACACCGATAAACTCATCTTA partial * * TRDJ1*01 TRDC 1 1.381273e-03 GGTATCAACGCAGAGTACGGGACAGCTATGGTTGACTACTGG partial * * IGHJ4*02 IGHD 1 1.381273e-03 TGTCAGCAGTGTGGTAGCTCACCCCTCACTTTC CQQCGSSPLTF IGKV3-20*01 * IGKJ4*01 IGKC 1 1.381273e-03 TGTGCGAGAGGCGGGAGTGGC partial IGHV4-34*01 * * IGHD 1 1.381273e-03 TGTCAGCAGTATGGTAGCTCACCCCCTCACTTTC out_of_frame IGKV3-20*01 * IGKJ4*01 IGKC 1 1.381273e-03 TGTCAGCAGTATGGTAGCTCACCCACTCACTTTC out_of_frame IGKV3-20*01 * IGKJ4*01 IGKC 1 1.381273e-03 CCATTGTGTGTACACTTT partial * * IGKJ2*01 IGKC 1 1.381273e-03 TGTGACAACTGGTTCGACCCCTGG partial * * IGHJ5*02 *

Thanks, Wendy

mourisl commented 3 years ago

Yes, the frequencies are computed for BCR and TCR respectively.

wxiang-us commented 3 years ago

@mourisl thank you, this makes sense. Do you have a manual that including more details? For example, how to interpret results, parameter suggestion for different library design etc. Thanks again!

mourisl commented 3 years ago

README has a comprehensive description of the results. For example, the last line of "Input/Output" section mentions "For frequency, the BCR(IG) and TCR(TR) chains are normalized respectively.".

There is no much parameter tuning for TRUST4. Maybe for postprocessing, one can filter the CDR3s with abundance less than 2 (singletons) to get a more confident set of BCR/TCRs. For RNA-seq, smart scRNA-seq, 10X scRNA-seq, TCR-seq/BCR-seq though, there are different ways/parameters to run TRUST4, which you can refer to the README.md file. Does this help? Thank you.

wxiang-us commented 3 years ago

@mourisl Thank you! I try to find more details for other output files. For example, other than tsv, there are raw.out, cdr3.out, final.out, all in similar format similar to fasta, but with additional lines containing a list of numeric values. What those numeric values indicating? >assemble0 IGHV1-3+IGHG4 TGTGGCGTCAGTGTACGGGGGCTGTGTATTACTGTGCGAGGGGCCTCCTCCGGGGGGGCTGGAACGACGTGGACTACTACTATGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCAGCCTCCACCAAGGGCCCATCGGTCTTCCCCCTGGCACCCTCCTCCAAGAGCACCTCTGGGGGCACAGCGGCCCTGGGCTGCCTGGTCAAGGACTACTTCCCCGAACCGGTGACGGTGTCGTGGAACTCAGGCGCCCTGACCAGCGGCGTGCACACCTTCCCGGCTGTCCTACAGTCCTCAGGACTCTACTCCCTCAGCAGCGTGGTGACCGTGCCCTCCAGCAGAGATCGGAAGAGCGTCG _0 0 0 0 0 0 0 0 0 729 0 0 0 1 729 0 1 0 0 0 0 2 0 0 1 0 2 729 3 2 729 1 3 1 0 0 1 0 729 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 1 1 1 2 1 729 729 0 0 726 0 1 2 0 1 729 0 2 729 1 1 728 0 1 728 2 0 0 2 728 3 0 0 729 0 0 3 1 2 0 1 0 0 0 2 729 729 0 0 0 726 0 0 727 0 1 1 1 1 721 0 3 1 7 0 1 0 0 0 0 714 0 0 0 2 0 0 709 0 0 711 711 0 0 0 0 0 0 698 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 67 0 0 0 0 0 0 0 0 0 68 68 0 64 0 0 63 0 0 0 0 0 0 0 0 0 0 0 55 0 55 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 672 672 0 1 672 0 0 674 0 3 0 0 0 0 0 0 668 667 0 0 0 1 0 0 663 0 1 0 1 0 0 0 1 0 2 0 657 656 1 0 0 656 3 0 2 0 1 0 0 1 0 652 0 0 651 0 0 0 0 0 0 0 0 0 648 0 643 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 647 0 647 0 0 1 0 0 0 648 0 0 647 0 0 0 0 643 0 1 0 0 0 0 0 646 0 0 644 0 0 0 0 0 0 0 0 643 0 0 0 0 0 0 1 0 0 0 0 642 0 0 638 0 4 0 4 0 0 0 0 3 3 0 2 0 0 0 0 0 0 0 0 0 0 0 729 0 0 729 0 0 0 0 0 0 725 0 0 0 0 0 726 0 0 0 0 0 0 0 0 0 726 0 2 1 1 728 0 0 0 1 0 2 729 728 0 728 728 0 726 725 0 1 1 0 1 2 0 728 1 0 0 0 0 729 0 3 729 0 0 0 0 0 727 0 0 728 0 0 728 0 0 0 1 0 0 0 0 0 0 0 729 0 0 728 1 2 0 0 1 729 727 0 0 1 0 0 2 727 728 0 727 0 1 0 722 1 721 717 0 0 720 0 720 718 0 719 2 2 716 715 2 713 713 2 712 711 0 0 0 2 2 708 707 706 0 0 695 0 0 0 78 0 0 74 74 74 74 74 0 0 0 67 0 67 67 67 0 68 68 0 68 68 0 0 0 0 0 62 0 61 61 0 61 0 0 0 0 0 0 55 0 55 0 0 43 0 0 43 44 44 0 0 0 0 673 0 0 674 672 0 0 0 0 672 0 0 0 0 0 673 0 0 670 1 0 668 667 667 667 1 0 0 668 666 0 0 0 0 0 662 0 0 0 0 0 658 0 1 0 0 0 0 657 0 658 0 0 0 652 0 651 655 654 0 0 0 653 648 0 0 649 0 0 646 0 0 0 649 0 649 0 648 648 1 0 646 647 648 0 0 646 0 0 0 647 648 0 0 648 0 0 0 647 648 0 647 0 0 0 0 646 0 647 0 0 646 0 646 646 646 1 646 0 0 645 0 0 644 0 0 0 0 2 0 0 643 643 0 0 0 643 642 643 0 643 643 0 0 641 0 0 0 0 0 0 3 0 0 0 0 0 0 0 2 0 0 2 0_

mourisl commented 3 years ago

raw.out is the raw assembly results, final.out file is extending the raw assemblies with mate pair information. In a TRUST4's assembly, it is actually the consensus of reads, that can accomodating the somatic hypermutations in BCRs. So following each assembly, there are 4 lines, for "A", "C", "G", "T", each number corresponds to the number of reads with that nucleotide on that position. _annot.fa file is the annotation of those consensuses assemblies against with the "--ref" file.

_cdr3.out is to read out the CDR3 information encoded in those consensuses. As you can see, one consensus (assembly) can comprise of multiple CDR3s. tsv file is just reformatting the cdr3.out file to follow the VDJTools format.

wxiang-us commented 3 years ago

@mourisl Thanks very much, these info are quite helpful! Have you compared TRUST4 with MiXCR regarding accuracy of somatic hypermutation? MiXCR is using a cluster based consensus generation algorithm. In my opinion, consensus based method is vulnerable to abundant transcripts, and the idea of generating consensus is against variation identification - which seems against detection of somatic hypermutations. How Trust4 balance accuracy of consensus sequences with detection of somatic hypermutation? Thanks in advance!

Best, Wendy

mourisl commented 3 years ago

Yes, in our analysis, TRUST4 consistently found more CDR3s than MiXCR in our experiments. When we tune MiXCR's parameter to have similar precision as TRUST4, TRUST4 still found more CDR3s. In TRUST4, we will realign the reads to the consensuses, and the CDR3 from those reads can be extracted accordingly, which includes those low abundant CDR3s.

wxiang-us commented 3 years ago

Thanks again for your prompt reply! Do you suggest to use certain read number or frequency based threshold to exclude low abundant CDR3 detected by TRUST4? How would you handle and interpret "partial" & "out of frame" sequences?

mourisl commented 3 years ago

In my own experience, though many of the singletons CDR3 (read count < 2) are still correct, but the precision should be lower than the ones with more read support. So if you want to have higher precision, you can remove those. For my analysis I usually filter the partial CDR3s, because the precision seems relatively low too. For out of frame, if your analysis is more focused on amino acid, you can ignore those CDR3s. Note that some of the CDR3 amino acids have symbols "_" for stop codon, and "?" if there is N in the nucleotide sequences. You can ignore those CDR3 too depending on your analysis.

wxiang-us commented 3 years ago

@mourisl Thanks very much, appreciate all your help!

mourisl commented 3 years ago

No problem. Please feel free to give feedback so we can improve the usability of TRUST4.