Quality control of TRUST4 and comparing with MiXCR and CellRangerVDJ

Hi, We recently ran TRUST4 on some 10x Genomics TCR data using the command trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 input/YWI006-VDJ_S1_L001_R1_001.fastq.gz -2 input/VDJ/YWI006-VDJ_S1_L001_R2_001.fastq.gz --barcode input/VDJ/YWI006-VDJ_S1_L001_R1_001.fastq.gz --readFormat bc:0:15,r1:16:-1 -o /nethome/Documents/trust4/output/

We are currently benchmarking TRUST4 against MiXCR and CellRangerVDJ so are comparing the results across all 3 softwares and have a couple of questions regarding the TRUST4 output. TRUST4 identifies a significantly greater number of unique clones compared to MiXCR and CellRangerVDJ .

CellRangerVDJ unique clones: 17
MiXCR v4 unique clones: 79
TRUST4 unique clones: 412
One reason is that some of clones may have point mutations after PCR amplification, so because of this we calculated the levenshtein distance between the concatenated alpha and beta cdr3 nucleotide sequences. We collapsed clones with a distance of either 1 or 2 for all software’s. We found that TRUST4 still had a much greater number of unique clones.
CellRangerVDJ unique clones: 16
MiXCR v4 unique clones: 55
TRUST4 unique clones:167

So first, we were wondering what sort of pre filtering is done within the TRUST4 algorithm on the sequences before getting the final output, and if not, what quality control measures should we be taking to ensure we are keeping real hits.

Second, many of the clones in TRUST4 have only 1 read, is it sensible to keep these? Or should we implementing some sort of read cut off?

Additionally, there is a read and a umi column in the final output, and we were wondering what exactly is the different because the values in both of these columns is the same.

Finally, when looking at overlap of these clones between the different software’s we find that many clones (especially between TRUST4 and MiXCR) do not overlap (see plot below). Do you have any thoughts as to why this could be?

Thanks in advance for any responses :)

Thank you for the detailed comparisons. Which output file are you checking?

TRUST4 does not try drastic filtering for QC, so I think the most effective filter is the number of reads supporting a particular CDR3/cell. For the CDR3/cell with multiple reads, the output contig will be the consensus, so the sequencing error has fewer impact. For the ones with single reads, they may contain sequencing errors, many of which I guess are handled by the sequence collapsing. In sum, the read support probably is the best way to QC TRUST4's output.

For the read count and UMI, since you don't specify the UMI region in the read, so the UMI information is not used. I also noticed in your running command and just want to confirm the read structure is that read1 is 16barcoded immediately followed by sequencing reads, and read2 is all sequencing reads without barcode and UMI. Or read1 also has UMI information?

Thank you.

liulab-dfci / TRUST4

Quality control of TRUST4 and comparing with MiXCR and CellRangerVDJ #265