Understanding of the output files

yuyuleung commented 8 months ago

Dear Dr. Li,

Firstly, I would like to express my gratitude for providing such an efficient immune assembly tool. I have a few questions regarding the output files generated after running trust4 (Therefore, I have created a new issue):

Based on the order of the output files, I understand that "to_assemble.fq" is generated first, followed by "assemble_reads.fq." However, I am unsure about the differences between these two files. Which file contains the reads ready for assembly, specifically those aligned to reference genes and retaining the unmapped reads with CDR3 motifs (as I have understood the logic of trust4)?
While examining the read counts in the statistics, I noticed that the "read fragment count" in "cdr3.out" is reported in decimal form. I am curious about the reason for this, especially considering that the similar statistic in ".report" is reported as a whole number. Additionally, I would like to confirm my understanding of this statistic. Does this number represent the count of reads used to assemble the full sequence of the corresponding consensus (e.g., "assemble0"), or does it only refer to the coverage of the CDR3 region?
Regarding the statistics in "anno.fa," I would like to understand how the third number, "average_coverage," is calculated. Is it the count of reads used to assemble the corresponding consensus divided by the consensus length?

Thank you in advance for your clarification and assistance.

Sincerely, Yuyu

yuyuleung commented 8 months ago

By the way, these output files I have mentioned were generated in the bulk mode. Thanks again.

mourisl commented 8 months ago

to_assemble is a very rough prediction of whether a read could be a candidate reads from VDJ region. It contains many non-VDJ reads. The assembled reads are the reads actually used in the assembly. It contains both CDR3 reads, and reads that may full contained in the V gene region or C gene region.
Some read can be ambiguous assigned to multiple CDR3, so TRUST4 apply the EM algorithm to better estimate the CDR3 abundances. It is the number of reads supporting this CDR3, not necessarily the reads for assemble the full sequence.
The averagecoverage is the (sum{read} read_length)/500, where here the read is the read for assemble the contig. It divides 500 instead of the actual contig is to reduce the coverage overestimation bias for short contigs.

Hope this helps.

yuyuleung commented 8 months ago

Dear Prof. Li,

Thank you so much for your detailed explantion. They are so helpful to me :).

Best wishes, Yuyu

liulab-dfci / TRUST4

Understanding of the output files #249