milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
335 stars 79 forks source link

Fixing UMI Duplication #1256

Closed bshim181 closed 1 year ago

bshim181 commented 1 year ago

Hello,

I am running MIXCR with preset generic-amplicon-with-umi and generic-amplicon. Since our wet-lab protocol supports 7bp UMI, there seems to be a high level of duplicate reads that gets filtered out during the pre-cloning step(Only 20% of the aligned reads are utilized for clonotype assembly).

As a result, when I run the same dataset with two different preset generic-amplicon-with-umi and generic-amplicon, there is a huge difference in the number of clonotypes identified (2000 clonotypes vs 10,000 clonotypes). I was wondering if there is a way to mediate this behavior or find a middle ground by manually adjusting the pre-clone assembler parameters. Thank you.

bshim181 commented 1 year ago

Also, another question is how does MIXCR handle reads with same UMIs (possibly duplicates due to sequencing errors) but different CDR3 sequences (different clonotypes). Does it select the CDR3 sequence that is most abundant(by the read count)?

mizraelson commented 1 year ago

Hi,

Firstly, MiXCR performs tag correction to handle sequencing errors in UMIs and other barcodes. This step is called mixcr refineTagsAndSort.

During pre-clone assembly, MiXCR uses a sophisticated algorithm to manage multiple consensuses within a single UMI group, rather than just selecting the most abundant one. In brief, MiXCR initially isolates consensuses which have a share higher than minRecordSharePerConsensus. It then sets aside the reads that belong to initial consensuses and attempts to isolate another consensus which has a share not less than minRecursiveRecordShare. This process is performed iteratively until the share of the next consensus is lower than this value, or before the number of consensuses exceeds maxConsensuses, or before the maxIterations limit is reached. There's an added layer of complexity that considers the sequencing quality at each iteration. Practically speaking, if you have 10 reads and 3 consensuses equally distributed across them, MiXCR will assemble all 3. If you have 10 consensuses per 10 reads, MiXCR won't assemble any.

Without examining your data, it's difficult to provide precise optimization recommendations. However, I suggest you try the following set:

mixcr analyze generic-amplicon-with-umi \
--species hsa \
--library imgt \
--rna \
--rigid-left-alignment-boundary \
--floating-right-alignment-boundary C \
--tag-pattern '^(R1:*)\^(UMI:N{7})' \
-Massemble.consensusAssemblerParameters.assembler.maxIterations=6 \
-Massemble.consensusAssemblerParameters.assembler.minRecordSharePerConsensus=0.02 \
-Massemble.consensusAssemblerParameters.assembler.minRecursiveRecordShare=0.1 \
-Massemble.consensusAssemblerParameters.assembler.maxConsensuses=6 \
Input_R1.fastq.gz \
Input_R2.fastq.gz
output

Please let me know how this works out.

Sincerely, Mark