liulab-dfci / TRUST4

TCR and BCR assembly from RNA-seq data
MIT License
256 stars 47 forks source link

Quality control of TRUST4 and comparing with MiXCR and CellRangerVDJ #265

Open skhakoo opened 2 months ago

skhakoo commented 2 months ago

Hi, We recently ran TRUST4 on some 10x Genomics TCR data using the command trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 input/YWI006-VDJ_S1_L001_R1_001.fastq.gz -2 input/VDJ/YWI006-VDJ_S1_L001_R2_001.fastq.gz --barcode input/VDJ/YWI006-VDJ_S1_L001_R1_001.fastq.gz --readFormat bc:0:15,r1:16:-1 -o /nethome/Documents/trust4/output/

We are currently benchmarking TRUST4 against MiXCR and CellRangerVDJ so are comparing the results across all 3 softwares and have a couple of questions regarding the TRUST4 output. TRUST4 identifies a significantly greater number of unique clones compared to MiXCR and CellRangerVDJ .

So first, we were wondering what sort of pre filtering is done within the TRUST4 algorithm on the sequences before getting the final output, and if not, what quality control measures should we be taking to ensure we are keeping real hits.

Second, many of the clones in TRUST4 have only 1 read, is it sensible to keep these? Or should we implementing some sort of read cut off?

Additionally, there is a read and a umi column in the final output, and we were wondering what exactly is the different because the values in both of these columns is the same.

Finally, when looking at overlap of these clones between the different software’s we find that many clones (especially between TRUST4 and MiXCR) do not overlap (see plot below). Do you have any thoughts as to why this could be?

image

Thanks in advance for any responses :)

mourisl commented 2 months ago

Thank you for the detailed comparisons. Which output file are you checking?

TRUST4 does not try drastic filtering for QC, so I think the most effective filter is the number of reads supporting a particular CDR3/cell. For the CDR3/cell with multiple reads, the output contig will be the consensus, so the sequencing error has fewer impact. For the ones with single reads, they may contain sequencing errors, many of which I guess are handled by the sequence collapsing. In sum, the read support probably is the best way to QC TRUST4's output.

For the read count and UMI, since you don't specify the UMI region in the read, so the UMI information is not used. I also noticed in your running command and just want to confirm the read structure is that read1 is 16barcoded immediately followed by sequencing reads, and read2 is all sequencing reads without barcode and UMI. Or read1 also has UMI information?

Thank you.