liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

Issues with the number of count of clones #112

Open HStephane opened 2 years ago

HStephane commented 2 years ago

Hi, I am quite new in the BCR/TCRsequencing field and I wanted to try to compare TRUST4 and MIXCR on a simple dataset (amplicons of the CDR3 regions).

I have a paired end fastq with about 20'000 reads each. When I run MIXCR, the reports stated that I have about 20'000 total reads with 93.62% successfully aligned and 87.09% used for clonotypes. However when I run TRUST4, there are about 40'000 reads found and assembled, and the sum of #counts of clones in the report is also about 40'000.

This is the code I used : ./run-trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 input/1_S1_L001_R1_001.fastq -2 input/1_S1_L001_R2_001.fastq -o analyzed_ --od output/ -t 4

I don't know exactly if I am doing something wrong here. It seems that it does not take into account the paired end file.

One thing that I noticed is that in the report, there are several "assembles" with the same CDR3.

Thank you for your time.

mourisl commented 2 years ago

Could you please show me the first few lines of 1_S1_L001_R1_001.fastq and 1_S1_L001_R2_001.fastq?

The assemblies with the same CDR3 could be due to there have differences in V and J genes.

HStephane commented 2 years ago

For the Read 1 :

@FS10000339:45:BPG61614-1218:1:1101:1170:1000 1:N:0:1 AGACCTCTCTGTACTTCTGTGCCAGCAGTGGGGCCGGACAGACCAATGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @FS10000339:45:BPG61614-1218:1:1101:2100:1000 1:N:0:1 ACTCAGCCGCGTATCTCCGTGCCAGCAGCCGAACGATGTTAAACTCCTACAATGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATTAGAAGCAGA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @FS10000339:45:BPG61614-1218:1:1101:4810:1000 1:N:0:1 ACTTGGCTGTGTATCTCTGTGCCAGCAGCTTGGATTTGGGGCGGGGGGCCAACGTCCTGACTTTCGGGGCCGGCAGCAGGCTGACCGTGCTGGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @FS10000339:45:BPG61614-1218:1:1101:4830:1000 1:N:0:1 ACTCAGCTGTGTATTTTTGTGCCACCTATGGAGGGAGCATTGAGACCATATATTTTGGAGAGGGAAGTTGGCTCACTGTTGTAGAGGACCTGAACAAGGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, @FS10000339:45:BPG61614-1218:1:1101:4920:1000 1:N:0:1 AGACATCTGTGTACTTCTGTGCCACCTATGGAGGGAGCATTGGGACCATATATTTTGGAGAGGGAAGTTGGCTCACTGTTGTAAAGGACCTGAACAAGGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,

For the Read 2 :

@FS10000339:45:BPG61614-1218:1:1101:1170:1000 2:N:0:1 CTTTTGGGTGTGGGAGATCTCTGCTTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCTAGCACGGTGAGCCGTGTCCCTGGCCCGAAGAACTGCTCATTGGTCTGTCCGGCCCCACTGCTGGCACAGAAG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF, @FS10000339:45:BPG61614-1218:1:1101:2100:1000 2:N:0:1 CTTTTGGGTGTGGGAGATCTCTGCTTCTAATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCTAGCACGGTGAGCCGTGTCCCTGGCCCGAAGAACTGCTCATTGTAGGAGTTTAACATCGTTCGGCTGCTGGCT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, @FS10000339:45:BPG61614-1218:1:1101:4810:1000 2:N:0:1 CTTTTGGGTGTGGGAGATCTCTGCTTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCCAGCACGGTCAGCCTGCTGCCGGCCCCGAAAGTCAGGACGTTGGCCCCCCGCCCCTAATCCTAGCTGCTGGCT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FF:FFFF,,,F:F,,:,:,,::F,, @FS10000339:45:BPG61614-1218:1:1101:4830:1000 2:N:0:1 CTTTTGGGTGTGGGAGATCTCTGCTTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACCTTGTTCAGGTCCTCTACAACAGTGAGCCAACTTCCCTCTCCAAAATATATGGTCTCAATGCTCCCTCCATAGGTGGCACAAAAATAC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFF, @FS10000339:45:BPG61614-1218:1:1101:4920:1000 2:N:0:1 CTTTTGGGTGTGGGAGATCTCTGCTTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACCTTGTTCAGGTCCTTTACAACAGTGAGCCAACTTCCCTCTCCAAAATATATGGTCCCAATGCTCCCTCCATAGGTGGCACAGAAGTAC + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Thank you

mourisl commented 2 years ago

The output on the screen does not use mate-pair information, so the 40000 number is about right. But the abundances for clonotypes should already take the mate-pair into account. Could you show me a few lines of "analyzed_assembled_reads.fa" and how did you compute "sum of #counts of clones in the report"? Thank you.

HStephane commented 2 years ago

So the first few lines of analyzed_assembled_reads.da is :

FS10000339:45:BPG61614-1218:1:1105:4600:2060 -1 2345 22078 CTTTTGGGTGTGGGAGATCTCTGCTTCTGATGGCTCAAACACAGCGACCTCGGGTGGGAACACGTTTTTCAGGTCCTCCAGTACGGTCAGCCTAGAGCCTTCTCCAAAAAACAGCT FS10000339:45:BPG61614-1218:1:1101:5610:1640 1 2124 10966 CTCAGCTGTGTATCTCTGTGCCAGCAGCCGAACGATGTTAAACTCCTACAATGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCACACCCAAAAG FS10000339:45:BPG61614-1218:1:1101:5610:1640.1 1 2124 10966 CTCAGCTGTGTATCTCTGTGCCAGCAGCCGAACGATGTTAAACTCCTACAATGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCACACCCAAAAG FS10000339:45:BPG61614-1218:1:1111:7240:1900 1 2124 10966 CTCAGCTGTGTATCTCTGTGCCAGCAGCCGAACGATGTTAAACTCCTACAATGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCACACCCAAAAG FS10000339:45:BPG61614-1218:1:1111:7240:1900.1 1 2124 10966 CTCAGCTGTGTATCTCTGTGCCAGCAGCCGAACGATGTTAAACTCCTACAATGAGCAGTTCTTCGGGCCAGGGACACGGCTCACCGTGCTAGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTTGAGCCATCAGAAGCAGAGATCTCCCACACCCAAAAG

And I just sum the #counts column in the report with the following code :

cat analyzed__report.tsv | awk '{ sum+=$1} END {print sum}'

Thank you very much

mourisl commented 2 years ago

Thanks for noticing this issue! This is indeed a bug in TRUST4. I just pushed the update to the Github repo. Could you please give it a try? Thank you.

HStephane commented 2 years ago

Thank you very much, works perfectly now!!