milaboratory / mixcr

MiXCR is an ultimate software platform for analysis of Next-Generation Sequencing (NGS) data for immune profiling.
https://mixcr.com
Other
335 stars 79 forks source link

Artificial gaps in reported amino acid sequences #1611

Closed pursellta closed 7 months ago

pursellta commented 7 months ago

When looking at the output of exportClones from bulk IgG sequences, I have quite a few sequences which are productive (i.e. no stop, no frame shift, etc) and complete/non-gapped nt sequences, but the amino acid translations have _ reported. I am using v4.6.0, a custom reference, and assemble contigs by VDJRegion.

This is the code I ran: mixcr exportClones --filter-out-of-frames --filter-stops -nMutations "{FR1Begin:FR3End}" -nMutationsRate "{FR1Begin:FR3End}" -aaMutations "{FR1Begin:FR3End}" -aaMutationsRate "{FR1Begin:FR3End}" 5RACEbulk.TermSpleen.allelic.clns 5RACEbulk.TermSpleen.allelic.tsv

An example of a concerning output is:

Target_sequence GAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCCCTCACCTGCACTGTGTCCGGGGGCTCAGTCAGCAACATTTACTATTGGCACTGGATCCGCCAGCCCCCGGGGAGAGGGTTGCAGTGGATGGGGTATTGGACAGGCAGCACAAACTACAACCCAGCCTTCCAGGGCCGCATCTCCATCTCTCCTGATACGGCCAGAACCCAGTTGTCCCTGCAACTGAATTCCGTGACCGCCGAGGACACGGCCTTCTATTACTGTGCAAGAGGGTTGTGGAGTAGCGGGGGCTGGTTCGACTCCTGGGGCCAGGGCACTCTGGTCACCGTCTCCTCCG

Specifically concerning are these feature translations:

aaSeqFR1 EVQLQESGPGLV_ALADPLPHLHCV aaSeqCDR1 RGLSQ_NIYY aaSeqFR2 WHWIRQPP_ERVAVDGV aaSeqCDR2 LDRST aaSeqFR3 NYNPAFQGRISISPDTARTQLSLQLNSVTAEDTAFYY aaSeqCDR3 CARGLWSSGGWFDSW aaSeqFR4 GQGTLVTVSS

When I translate the target sequence, I get the following:

rf 1 Target_sequence EVQLQESGPGLVKPSQTLSLTCTVSGGSVSNIYYWHWIRQPPGRGLQWMGYWTGSTNYNP AFQGRISISPDTARTQLSLQLNSVTAEDTAFYYCARGLWSSGGWFDSWGQGTLVTVSS

And when I put the nucleotide sequence into IgBLAST this is the output I get which is what I would expect from MiXCR:

aaSeqFR1 VQLQESGPGLVKPSQTLSLTCTVS aaSeqCDR1 GGSVSNIYY aaSeqFR2 WHWIRQPPGRGLQWMG aaSeqCDR2 YWTGST aaSeqFR3 NYNPAFQGRISISPDTARTQLSLQLNSVTAEDTAFYYC aaSeqCDR3 ARGLWSSGGWFDS aaSeqFR4 WGQGTLVTVSS

mizraelson commented 7 months ago

Hi, it's hard to tell without looking at the alignment of the specific clone, but I think I know what is going on. So this is a tricky thing. My guess is there are some nucleotide deletions in the sequence. We know that all FRs should end with a complete codon thus should have a strict AA border. If it doesn't, it is not clear where to assign the last AA during the translation. Thats why we translate each FR and CDR from both ends moving towards the center and in a normal scenario this returns an accurate translation. If one nucleotide was deleted or inserted there will be a wrong AA sequence for this region, when exported separately. So this is done in order to correctly define AA borders of the segments ending with a complete codon. The workaround here is to add an export column with the VRegion or the VDJRegion for example. These regions are translated from one end, eliminating the need to distinguish segment borders, and return the correct AA receptor sequence. Use -aaFeature VRegion or -aaFeature VDJRegion for this purpose.

pursellta commented 7 months ago

Ok, thank you!