milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
336 stars 79 forks source link

N bases #1799

Closed bshim181 closed 1 month ago

bshim181 commented 2 months ago

i have noticed in that in the reports files, if there is a base N in the gene feature sequences, it translates to amino acid X. I was wondering if there is a way to handle those base Ns. is there a way to replace those bases based on the reference?

Screenshot 2024-09-23 at 9 17 38 AM
mizraelson commented 2 months ago

Hi, what command do you run to analyze the data?

bshim181 commented 2 months ago

preset of analyze rnaseq-full-length with MiXCR version 4.3.2 I believe. Is there a possibility where updating to newer version of MiXCR might solve the issue?

Also, if updating to new MiXCR version is hard to do(predefined sets of workflow), is there a way to modify the parameter to handle this?

bshim181 commented 2 months ago

From looking at alignment files, it seems like alignment gaps leads to these ambiguous base of N.

Screenshot 2024-09-26 at 9 48 53 PM
mizraelson commented 2 months ago

Not exactly. In the example above, there is no ambiguity, but rather a single nucleotide deletion in FR3, which will shift the reading frame, rendering the clone non-productive.

The appearance of “N” occurs during the assembleContigs step, when MiXCR extends the initially assembled CDR3 clones to cover more regions of the sequence. This is where ambiguity can arise. You can discard such sequences by adding the following to the analyze command:

-MassembleContigs.parameters.discardAmbiguousNucleotideCalls=true to the analyze command.

bshim181 commented 1 month ago

Regards to the image I have sent above, so I looked at an example where N base appeared in the sequence.

Screenshot 2024-10-01 at 11 44 12 AM

This is the sequence I looked at, I believe in the FR3 region with two Ns in the sequence.

I see two different pools of reads. Out of total of 21 reads that cover map to this clone, about half of the reads have this variation.

At these two positions with N, I am seeing deletion in the first N position and mismatch in the second N position (mismatch between reference=G and query=C)

image

For another half of reads, I am seeing mismatch in the first N position and the match to the reference in the second position.

Screenshot 2024-10-01 at 11 46 14 AM

Rather than replacing these bases with N, is there a possibility to output all possible sequences with variants? we are also interested in mutations within vdj sequences and these read evidences might be pointing toward potential biologically relevant targets.

mizraelson commented 1 month ago

Did you try using: -MassembleContigs.parameters.discardAmbiguousNucleotideCalls=true ? Do you still see Ns in the sequences?

Regarding the first case: a deletion of A nucleotide in FR3 will lead to a frameshift in translation of CDR3, FR4 and C gene and this clone will not be functional.

bshim181 commented 1 month ago

I have tried using -MassembleContigs.parameters.discardAmbiguousNucleotideCalls=true and it does discard ambiguous nucleotides and replaces with the reference sequence.

Possibility that I am considering here is that the variants captured in these reads are mutations rather than sequencing error and therefore i was wondering if there is a way to output all possible variation at those N base positions (rather than getting replaced with ambiguous base).

mizraelson commented 1 month ago

I see. Generally speaking, there is an algorithm behind assembleContigs that splits a clone if there is enough data to support both variants, which is the output you’re looking for. This algorithm considers the shares of each variant, the Phred quality of the nucleotides, their location on the read, and the surrounding context (for example, if you have NN, there might be multiple possible resolutions) among other things. In some cases, there isn’t enough data to determine if the clone should be split, and MiXCR will then place an N. Several parameters guide this process, but the main ones are:

-MassembleContigs.parameters.branchingMinimalQualityShare=0.1
-MassembleContigs.parameters.branchingMinimalSumQuality=60
-MassembleContigs.parameters.outputMinimalQualityShare=0.75

These are the default values for MiXCR v4.7 with the rna-seq preset. You can find explanations for all parameters on our website. I recommend trying the latest version first and adjusting the parameters if needed (generally, the lower the thresholds, the more likely MiXCR will split a clone into two).

That said, based on our experience, the default parameters work best, as they have been empirically evaluated on hundreds of different datasets.