liulab-dfci / RIMA_pipeline

68 stars 21 forks source link

Quality control of TRUST4 and comparing with MiXCR and CellRangerVDJ #25

Closed skhakoo closed 4 months ago

skhakoo commented 4 months ago

Hi, We recently ran TRUST4 on some 10x Genomics TCR data using the command trust4 -f hg38_bcrtcr.fa --ref human_IMGT+C.fa -1 input/YWI006-VDJ_S1_L001_R1_001.fastq.gz -2 input/VDJ/YWI006-VDJ_S1_L001_R2_001.fastq.gz --barcode input/VDJ/YWI006-VDJ_S1_L001_R1_001.fastq.gz --readFormat bc:0:15,r1:16:-1 -o /nethome/daniel.fonseca/Documents/trust4/output/

We are currently benchmarking TRUST4 against MiXCR and CellRangerVDJ so are comparing the results across all 3 softwares and have a couple of questions regarding the TRUST4 output. TRUST4 identifies a significantly greater number of unique clones compared to MiXCR and CellRangerVDJ .

One reason is that some of clones may have point mutations after PCR amplification, so because of this we calculated the levenshtein distance between the concatenated alpha and beta cdr3 nucleotide sequences. We collapsed clones with a distance of either 1 or 2 for all software’s. We found that TRUST4 still had a much greater number of unique clones.

So first, we were wondering what sort of pre filtering is done within the TRUST4 algorithm on the sequences before getting the final output, and if not, what quality control measures should we be taking to ensure we are keeping real hits.

Second, many of the clones in TRUST4 have only 1 read, is it sensible to keep these? Or should we implementing some sort of read cut off?

Additionally, there is a read and a umi column in the final output, and we were wondering what exactly is the different because the values in both of these columns is the same.

Finally, when looking at overlap of these clones between the different software’s we find that many clones (especially between TRUST4 and MiXCR) do not overlap (see plot below). Do you have any thoughts as to why this could be?

image

Thanks in advance for your response :)