liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

Assemble counts and read assign to assembles #181

Closed ljouneau closed 1 year ago

ljouneau commented 1 year ago

Dear TRUST4 contributors,

After having run TRUST4 on a library I tested following quality control :

For each assemble, I compared :

The first count is usually greater than the second one (see file attached to this issue).

Does it mean that a read can contribute to several {assemble ; CDR3} but only one of these association is reported in "*assign.out" file ?

Thank you in advance for your answer. Best regards Luc counts_summed_by_assemble.txt

mourisl commented 1 year ago

The report file coalesce the entries sharing the same VDJ, CDR3 information and sum up the count. The assemble id in the report file is the representative contig (the one with highest abundance). Nevertheless, the number of rows in the _assign.out should be greater than the sum of counts in _report.out file. The assign.out file also contains the reads fully contained in V gene, and the *_report.out file should count the number of reads overlapped with the CDR3 region.

There are many entires near the end of the txt file you shared, do those contigs only show up in the report but not in the assign.out file?

ljouneau commented 1 year ago

Nevertheless, the number of rows in the _assign.out should be greater than the sum of counts in _report.out file.

Indeed, this is what I wanted to point out

There are many entires near the end of the txt file you shared, do those contigs only show up in the report but not in the assign.out file?

Yes

$ cut -f 2 *assign* | sort -u > assemble_in_assign.txt $ cut -f 9 *report.out | sort -u > assemble_in_report.txt $ diff assemble_in_assign.txt assemble_in_report.txt | grep ">" | head > assemble10169 > assemble10187 > assemble1029 > assemble10340 > assemble10354 > assemble10535 > assemble10670 > assemble10703

$ grep "assemble10169" assign.out | wc -l 0 $ grep "assemble10187" assign.out | wc -l 0 $ grep "assemble1029$" *assign.out | wc -l 0

$ diff assemble_in_assign.txt assemble_in_report.txt | grep ">" | wc -l 6616

mourisl commented 1 year ago

I tried several data set, and the lines in assign.out is always way larger than the sum of the report.tsv file. Which version of TRUST4 did you use? Could you please share your running command? Thank you.

ljouneau commented 1 year ago

> Which version of TRUST4 did you use ? I do not know how to print the version I have but I have updated my branch master the 27th of January

$ ls -l *cpp -rw-rw---- 1 ljouneau BDR-ER5 32074 Jan 27 16:08 Annotator.cpp -rw-rw---- 1 ljouneau BDR-ER5 25703 Jan 27 16:08 BamExtractor.cpp -rw-rw---- 1 ljouneau BDR-ER5 17622 Jan 27 16:08 FastqExtractor.cpp -rw-rw---- 1 ljouneau BDR-ER5 49389 Jan 27 16:08 main.cpp

As for the command line I used

TRUST4_PATH/run-trust4 -t 4 -f trout_reference/${strain}_VDJ.fasta \ --ref trout_reference/${strain}_VDJ.fasta \ -u assemble_pass/${sample}_assemble-pass.fastq.gz \ -o $output_dir/$sample \ --outputReadAssignment --noExtraction

Best regards

mourisl commented 1 year ago

You can run "run-trust4" without any command to check the version which is on the first line. It's also a bit strange in your file name, where TRUST4 puts the report information in the *_report.tsv file, not the .out file.

Does the output_dir folder contain _report.tsv or the _report.out file is for other samples?

ljouneau commented 1 year ago

Version of TRUST4 TRUST4 v1.0.8-r430

Does the output_dir folder contain _report.tsv or the _report.out file is for other samples?

No I have one output directory per sample.

BTW, I did the same control for the other samples and, for some assemble, I also have more counts in the report than in the assign.out file (see scripts and head of the results attached to this mail). Best regards assemble_compare_report_and_assign.zip

Best regards

mourisl commented 1 year ago

Thank you for providing the script. The version is up-to-date. I'm just curious that TRUST4 does not have the _report.out file in the output. The results should be stored in outputs _cdr3.out or _report.tsv (the suffix is tsv not out), so I'm wondering how did you get the _report.out file?

ljouneau commented 1 year ago

I ran trust-simplerep.pl after run-trust4. This is how I obtained the report.out : perl TRUST4_PATHtrust-simplerep.pl $output_dir/${sample}_cdr3.out > $output_dir/${sample}_report.out

This morning I ran again my script on report.tsv files and I got the same result (see joined). Best regards asssemble_compare_report_and_assign.csv

mourisl commented 1 year ago

I see. It seems that the input reads are already somehow assembled contigs, so it probably can not be regarded as short reads. In this case, the unmatched count could happen.

  1. As mentioned earlier, the count in report.tsv is the summation of different contigs with the same VDJ and CDR3 information, and the contig id in the report file is the representative one. So the number in report.tsv could be greater than the one shown in the assign.out file.

  2. In the annotator stage, TRUST4 will realign the reads to the contigs to uncover hidden CDR3s in the consensus motifs. The alignment results could be slightly different from the alignment in the assembly stage, so some contigs may have no read assignment in your case. (In short reads scenario, even if the read can be aligned to a contig, if they could not fully cover the CDR3 region, the count will be ignored). In this case, TRUST4 will add a pseudo count 1 to the contig, and still report it in the output. (This helps in a short read setting when no read spans the whole CDR3 region. https://github.com/liulab-dfci/TRUST4/blob/master/Annotator.cpp#L894). There could be multiple contigs with the same complete CDR3 information and the pseudo count of 1. They will be merged again in the report file. Therefore, you can see some contigs without any read assignment.

If you comment out https://github.com/liulab-dfci/TRUST4/blob/master/Annotator.cpp#L894 line, the number in report file should be smaller or equal to the number in assign.out file. Hope this helps clarify the issue.

ljouneau commented 1 year ago

I am not sure to completely understand but I will think about it more deeply later and, meanwhile, I ... TRUST you :-)

I confirm that the sequences I use as input for TRUST4 are the result of an assembled sequence of a paired end short reads (R1-R2)

I would like to try what you propose but I do not see any comment at line 894 of Annotator.cpp source code.

mourisl commented 1 year ago

Sorry for the confusion. In essential, TRUST4 will force some contigs to have the abundance 1 if there is no read to span its CDR3 region. This part is done on this line https://github.com/liulab-dfci/TRUST4/blob/master/Annotator.cpp#L894, so you can remove it to disable this "feature" after recompilation.

ljouneau commented 1 year ago

I tried what you propose but I still have more counts in report.tsv than reads in assign.out

Annotator.cpp@line 894 : line commented

$ head -n 894 TRUST4_HOME/Annotator.cpp | tail -1 //info.push_back( nc ) ;

make has taken it into account

$ ls -l TRUST4_HOME/Annotator.* -rw-rw---- 1 ljouneau BDR-ER5 32076 Feb 3 16:16 TRUST4_HOME/Annotator.cpp -rw-rw---- 1 ljouneau BDR-ER5 2938544 Feb 3 16:16 TRUST4_HOME/Annotator.o

New results have been produced

$ ls -lrt F067_DNA_Sp_d78_NucNvac2_1 total 14717648 -rw-rw---- 1 ljouneau BDR-ER5 254146017 Feb 3 19:57 F067_DNA_Sp_d78_NucNvac2_1_raw.out -rw-rw---- 1 ljouneau BDR-ER5 498201221 Feb 3 19:57 F067_DNA_Sp_d78_NucNvac2_1_assembled_reads.fa -rw-rw---- 1 ljouneau BDR-ER5 254146017 Feb 3 19:57 F067_DNA_Sp_d78_NucNvac2_1_final.out -rw-rw---- 1 ljouneau BDR-ER5 53852269 Feb 3 20:00 F067_DNA_Sp_d78_NucNvac2_1_airr_align.tsv -rw-rw---- 1 ljouneau BDR-ER5 99493115 Feb 4 00:28 F067_DNA_Sp_d78_NucNvac2_1_assign.out -rw-rw---- 1 ljouneau BDR-ER5 42130777 Feb 4 00:28 F067_DNA_Sp_d78_NucNvac2_1_cdr3.out -rw-rw---- 1 ljouneau BDR-ER5 53138721 Feb 4 00:28 F067_DNA_Sp_d78_NucNvac2_1_annot.fa -rw-rw---- 1 ljouneau BDR-ER5 26544036 Feb 4 00:29 F067_DNA_Sp_d78_NucNvac2_1_report.tsv -rw-rw---- 1 ljouneau BDR-ER5 264156211 Feb 4 00:29 F067_DNA_Sp_d78_NucNvac2_1_airr.tsv -rw-rw---- 1 ljouneau BDR-ER5 26544036 Feb 4 00:29 F067_DNA_Sp_d78_NucNvac2_1_report.out -rw-rw---- 1 ljouneau BDR-ER5 30985043 Feb 4 00:29 F067_DNA_Sp_d78_NucNvac2_1_report_with_ambiguity.tsv -rw-rw---- 1 ljouneau BDR-ER5 149638 Feb 4 00:29 get_ambiguity.log -rw-rw---- 1 ljouneau BDR-ER5 106934963 Feb 4 00:36 assemble_CDR3_readid.txt -rw-rw---- 1 ljouneau BDR-ER5 19498151 Feb 4 00:36 get_sequence_id.log -rw-rw---- 1 ljouneau BDR-ER5 13338400174 Feb 4 00:36 best_similarity.txt -rw-rw---- 1 ljouneau BDR-ER5 673595 Feb 6 08:42 assemble_in_report.txt -rw-rw---- 1 ljouneau BDR-ER5 1033876 Feb 6 08:42 assemble_in_assign.txt -rw-rw---- 1 ljouneau BDR-ER5 767826 Feb 6 08:42 assemble_compare_report_and_assign.csv

There are still more counts into report.tsv than in assign.out

$ head F067_DNA_Sp_d78_NucNvac2_1/assemble_compare_report_and_assign.csv Assemble;counts in report;counts in assign assemble4;7899;7023 assemble39;6106;5090 assemble153;5985;5813 assemble154;5453;4588 assemble6;5255;5206 assemble20;4436;3478 assemble1;4131;3641 assemble1201;4095;3925 assemble3;3956;3761

mourisl commented 1 year ago

You will still observe that the counts in report are great than the counts in assign.out due to the report coalescing. But probably you wouldn't find find any contigs in report file that has no read in the assign.out file.

ljouneau commented 1 year ago

But probably you wouldn't find find any contigs in report file that has no read in the assign.out file.

Yes you are right.