milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
316 stars 78 forks source link

High Rate of Unassigned Alignments in Clonotype Assembly #1250

Closed bshim181 closed 12 months ago

bshim181 commented 1 year ago

Hello,

I am trying to decide on whether the new version of MIXCR is compatible for our TCR seq pipeline. Our TCR Seq pipeline is based on the paper "RNase H-dependent PCR-enabled T Cell Receptor sequencing (rhTCRseq) for Highly Specific and Efficient Targeted Sequencing of T Cell Receptor mRNA for Single-Cell and Repertoire Analysis"

The read structure resembles the form,

image

This was the command I used to run mixcr with

mixcr analyze generic-amplicon-with-umi \ --species hsa \ --library imgt \ --rna \ --rigid-left-alignment-boundary \ --floating-right-alignment-boundary C \ --tag-pattern '^(R1:*)\^(UMI:N{7})' \ ${R1_file} \ ${R2_file} \ /mixcr_OUTPUT/${filename_woExt}/${filename_woExt}

The problem I have encountered is that in comparison with our existing pipeline where it uses the old MIXCR (V2), it differs significantly in terms of the unique clonotypes output

image

I am aware that there is level of UMI correction and finding pre-consensus. Also, there is level of CDR3 clustering based on nt mismatch thresholds (2nt or 1 indels).

But the difference I am seeing here is very noticeable and I am worried that my set up might be incorrect. I am also confused by the high rates of unassigned alignments in clonotype assembly which might have caused overall decrease in number of unique clonotypes identified. What might have caused this high rates of unassigned alignments during clonotype assembly? This was visible in all 4 samples which I have conducted test runs on.

Successfully aligned reads: 97.60% [OK] Off target (non TCR/IG) reads: 0.61% [OK] Reads with no V or J hits: 1.77% [OK] Reads with no barcode: 0.0% [OK] Alignments that do not cover CDR3: 0.50% [OK] Tag groups that do not cover CDR3: 0.017% [OK] Barcode collisions in clonotype assembly: 20.29% [ALERT] Unassigned alignments in clonotype assembly: 82.59% [ALERT] Reads used in clonotypes: 16.48% [ALERT] Alignments dropped due to low sequence quality: 0.0% [OK] Alignments clustered in PCR error correction: 0.0% [OK] Clonotypes clustered in PCR error correction: 0.0% [OK] Clones dropped in post-filtering: 0.0% [OK] Alignments dropped in clones post-filtering: 0.0% [OK] Reads dropped in tags error correction and filtering: 1.51% [OK] UMIs artificial diversity eliminated: 11.85% [OK] Reads dropped in UMI error correction and whitelist: 0.0% [OK] Reads dropped in tags filtering: 1.51% [OK]

mizraelson commented 1 year ago

Hi, The command seems fine, but I think the issue lies in the length of your UMI. 7 nt is quite short to create sufficient diversity, so what most likely happened is you have multiple CDR3 assigned to the same UMI hence it wasn't possible to create 1 consensus. Can you please share the assemble report to confirm?

bshim181 commented 1 year ago

Hello, This was the assemble report. Seems like there was high number of assembling feature sequences in groups with zero pre-clonotypes: 178768

Analysis time: 1.07m Final clonotype count: 3104 Reads used in clonotypes, percent of total: 40694 (16.48%) Average number of reads per clonotype: 13.11 Reads dropped due to the lack of a clone sequence, percent of total: 1239 (0.5%) Reads dropped due to a too short clonal sequence, percent of total: 0 (0%) Reads dropped due to low quality, percent of total: 0 (0%) Reads dropped due to failed mapping, percent of total: 410 (0.17%) Reads dropped with low quality clones, percent of total: 194 (0.08%) Aligned reads processed: 41298 Reads used in clonotypes before clustering, percent of total: 40694 (16.48%) Number of reads used as a core, percent of used: 40645 (99.88%) Mapped low quality reads, percent of used: 49 (0.12%) Reads clustered in PCR error correction, percent of used: 0 (0%) Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%) Clonotypes dropped as low quality: 28 Clonotypes eliminated by PCR error correction: 0 Clonotypes pre-clustered due to the similar VJC-lists: 0 Clones dropped in post filtering: 0 (0%) Reads dropped in post filtering: 0.0 (0%) Alignments filtered by tag prefix: 0 (0%) TRA chains: 1870 (60.24%) TRA non-functional: 188 (10.05%) TRB chains: 1234 (39.76%) TRB non-functional: 23 (1.86%) Pre-clone assembler report: Number of input groups: 5747 Number of input groups with no assembling feature: 1 Number of input alignments: 237344 Number of alignments with assembling feature: 236105 (99.48%) Number of output pre-clones: 3920 Number of pre-clonotypes per group:
0: + 3111 (54.14%) = 3111 (54.14%) 1: + 1469 (25.57%) = 4580 (79.71%) 2: + 1047 (18.22%) = 5627 (97.93%) 3: + 119 (2.07%) = 5746 (100%) Number of assembling feature sequences in groups with zero pre-clonotypes: 178768 Number of dropped pre-clones by tag suffix conflict: 0 Number of dropped alignments by tag suffix conflict: 0 Number of core alignments: 41259 (17.38%) Discarded core alignments: 194846 (472.25%) Empirically assigned alignments: 39 (0.02%) Empirical assignment conflicts: 0 (0%) Tag+VJ-gene empirically assigned alignments: 39 (0.02%) VJ-gene empirically assigned alignments: 0 (0%) Tag empirically assigned alignments: 0 (0%) Number of ambiguous groups: 1166 Number of ambiguous V-genes: 87 Number of ambiguous J-genes: 47 Number of ambiguous tag+V/J-gene combinations: 134 Ignored non-productive alignments: 0 (0%) Unassigned alignments: 196040 (82.6%)

mizraelson commented 1 year ago

Yes, that seems like an issue with UMIs. If you can share a fastq file it's pretty easy to export alignments with UMIs with lists of CDR3. But in general, 7 nt is a very low number. Usually 12 ish nucleotides is recommended. With this data I would recommend to analyze it without UMIs.

bshim181 commented 1 year ago

Is there a parameter to turn off pre-consensus with UMI? Would it be a centralized parameter with analyze or would i have to run each step? I am guessing it would be the generic-amplicon preset?

mizraelson commented 1 year ago

Yes, you can just use the preset without UMI and use tag pattern to trim first seven nucleotides to facilitate alignment. --tag-pattern ^(R1:*)\^N{7}(R2:*) if you wanna use part of the sequence from R2. Does it cover a part of V gene?

bshim181 commented 1 year ago

R2 overlaps very little of the V gene(18 bp) i believe but will specify the tag pattern and try to include them in the analysis. Thank you!

mizraelson commented 1 year ago

Of course! Let me know if there will me any other questions.

bshim181 commented 1 year ago

Another question I had was, if I still wanted to make use of UMI in the sequence, possibly loosen the threshold for finding UMI-based consensus(For example if you find 10 or more CDR3 sequences with same UMI, you discard them but less than that, you still include them in the clonotype assembly), is it possible to do so?

For example, I know like TRUST4, if there is multiple CDR3 for a single cell (UMI) it regards the most abundant CDR3 as the true CDR3 for a chain, and the less abundant CDR3s as secondary.

From my understanding, if i specify a preset for generic amplicon, I am assuming that clustering is only considered based on the gene feature similarity ( which i specified as CDR3) and there will be no UMI based consensus found. I still hope to make use of those UMI sequences present in the read to a certain degree while not having to sacrifice so many alignments in the process.

mizraelson commented 12 months ago

Answered in #1256