liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
272 stars 47 forks source link

Stats differences between v1.0.2 and previous vesion #38

Closed wxiang-us closed 3 years ago

wxiang-us commented 3 years ago

Hello,

I've downloaded latest version v1.0.2 and tested it with dataset I processed via an earlier version downloaded in July, 2020. I found that in the tsv file, there are two additional columns 'cid' and 'cid_full_length', other than this, the count & frequency also change quite a bit; I also noticed there is no more 'partial' records. Please find example tsv files generated from the same sample here https://github.com/wxiang-us/data

In your release note, it mentioned "Improve the sensitivity of annotating full-length assemblies." I guess it's contributed to the stats difference. But if it is possible, could you please help me understand what's the major changes in v1.0.2 and how comparable the results are to previous version?

Also, I've generated pseudo fastqs to process via TRUST4. But got error message saying: 'The two mate-pair read files have different number of reads.' But the paired-fastqs I tested with did have same number of reads. Do you have certain requirement for read names in FASTQs? I did define pseudo names to keep track of my simulated data. I've also uploaded fastq example r1.fastq & r2.fastq here https://github.com/wxiang-us/data

Thanks again for your help!

mourisl commented 3 years ago

For the abundance estimation between the two versions, there were several bug fixes that might change abundance estimation. I think the most relevant fixing in your case might be that when the two mate read id have an unconventional suffix such as ".1/.2", TRUST4 might regard those as two different reads in older versions, hence overestimate the abundance by twice. If possible, can you share the file CATAATGGTCCTTATT_0450toassemble(1/2).fq, so I can double-check whether it could be caused by other bugs.

For the r1.fastq and r2.fastq, it seems in r2 file, the read length is 169bp while the quality score length is 150bp, hence causing parsing errors. In addition to that, in the fastq files, the header should start with "@" instead of ">", otherwise TRUST4 may use the read files as fasta format, and would use the quality score entries as reads.

wxiang-us commented 3 years ago

@mourisl Thanks for offering to help! Greatly appreciate! I've uploaded the files to https://github.com/wxiang-us/data, including both old version & v1.0.2 toassemble_(1/2).fq

mourisl commented 3 years ago

Thanks for sharing the files. It is indeed an undesired result from TRUST4. I just fixed the issue causing this abundance estimation error. Could you please git pull the new version and give it a try?

I believe you will still observe different abundance estimations in other data sets, and those could be due to that the older version triggers the issue.

wxiang-us commented 3 years ago

@mourisl Thanks for your prompt check. I saw your new commit were added to v1.0.2 https://github.com/liulab-dfci/TRUST4/commit/98013bec0fe8c869a68b701fe4179828a139b2ea I thus pull v1.0.2 and installed again, tested with the same input. The results are identical to previous v1.0.2 results. Should I pull from a different branch?

mourisl commented 3 years ago

I guess pull v1.0.2 will pull the code under the tag and is still the unfixed code. Can you try "git clone https://github.com/liulab-dfci/TRUST4" to get the most updated code?

wxiang-us commented 3 years ago

@mourisl Thank you! I installed the new one. The results indeed look different from the other two version - I've uploaded the files to the same place (with prefix '0409').

Could you please help me understand a bit more about your fix? Seems you changed procedure about 'outputCDR3File' and MergeOverlappedSeqContigs? Could you explain a bit more about the cid & cid_full_length columns as well? How to interpret different record with the same cid?

mourisl commented 3 years ago

When TRUST4 align a read to a consensus/contig, it will pick the first one contig if there are multiple equally good hits, and the first one usually have higher abundance than later contigs in the implementation. However, during extending contigs (scaffolding) with mate pair information, the order might be changed, so the alignment happens in the abundance estimation stage might priotize wrong assemblies. In the fix, I just added a sorting to reorder the assemblies to make sure the implicit order. "outputCDR3File" is for other features.

cid stands for consensus id. There could be many highly similar BCRs arising from somatic hypermuatations, so TRUST4 represent them as a consensus, which improves both time and memory efficiency. Therefore, each consensus can encode multiple CDR3, and you can see different record with the same cid. Though this feature is motivated by somatic hypermutation, it is essentially for representing similar sequence and might happen in TCR as well. If the consensus is full length (from V gene 5' beginning to part of constant gene), then the report file will denote this information. cid information are most useful if you are interested in the sequences outside of CDR3s.

wxiang-us commented 3 years ago

@mourisl Thanks for your explanation. So by the default parameters, TRUST4 did assemble full length sequences? I was in an impression that it mainly does CDR3 assembly.

mourisl commented 3 years ago

Yes. TRUST4 tries to assemble as long as possible, so some of the assemblies can reach full length.

wxiang-us commented 3 years ago

@mourisl Thank you! In which file I can get the full length sequence information? The tsv file only contains CDR3nt.

mourisl commented 3 years ago

The last two columns in the report file have the consensus/assembly id, you shall find the corresponding assemblies in the file TRUST_annot.fa file.

wxiang-us commented 3 years ago

Thanks again for your prompt reply! Is there a documentation mentioning how to interpret output files? particularly, what's the differences between raw.out & final.out?
what's the values in last four columns of cdr3.out mean? For example: 1.00 6972.02 92.11 0 In assembled_reads.fa, what the values in read name line mean? For example: -1 6530 6762 In annot.fa, what the values in read name line? the values in (), and the value before = ? For example:assemble0 782 5833.70 IGLV3-1901(290):(139-428):(0-289):97.24 IGLJ201(38):(428-463):(2-37):91.67,IGLJ301(38):(428-463):(2-37):91.67 IGLC2(462):(464-781):(0-294):97.23 CDR1(214-231):100.00=AGCCTCAGAAGCTATTAT CDR2(283-291):88.89=GGTAAAAAT CDR3(397-435):100.00=TGTAACTCCCGGGACACCAGTGATAACCATCTGGTCTTC

Thanks again!

mourisl commented 3 years ago

raw.out contains the assembled contig consensuses. final.out is for scaffolds by connecting contigs with mate pair information.

The last 4 columns of cdr3.out file means: the score of CDR3 (0 is partial, other values are based on the motif YYC...F/WGXG), the number of read pairs supporting this CDR3 in this consensus, the germline similarity based on the alignment of V, J gene segment in the CDR3 region and D gene alignment, last number indicates whether this consensus assembly is full length or not.

The format for the annot file is a bit complicated. In short, the field means: gene_name(reference_gene_length):(consensus_start-consensus_end):(reference_start-reference_length):similarity

The detail of the information can be found in the readme file: https://github.com/liulab-dfci/TRUST4#inputoutput

wxiang-us commented 3 years ago

@mourisl Thanks very much!

mourisl commented 3 years ago

You are welcome. Thanks for providing the files for TRUST4 debugging. That was very helpful!.