oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
5 stars 0 forks source link

InvalidMutationException #6

Closed alantsangmb closed 1 year ago

alantsangmb commented 1 year ago

I have a data causing the stop of gnomonicus due to an invalid mutation present in the final.vcf produced by clockwork. I believe the error is due to the invalid deletion detected in the file, so it is likely not a program bug. I have tried running gnomonicus using the cortex.vcf and samtools.vcf from the same data. Interestingly, gnomonicus can be completed without error using the cortex.vcf. And I found that the invalid deletion at position 3448496 is also present in the cortex.vcf. All the three vcf are attached for your reference. Thank you so much.

The error using the final.vcf is listed below:

100%|█████████████████████████████████████| 596/596 [00:00<00:00, 294267.83it/s] 100%|█████████████████████████████████████████| 451/451 [01:39<00:00, 4.55it/s] 48%|██████████████████▉ | 299/617 [00:00<00:00, 6226.37it/s] Traceback (most recent call last): File "/usr/local/bin/gnomonicus", line 84, in effects, metadata = populateEffects(options.output_dir, resistanceCatalogue, mutations, referenceGenes, vcfStem) File "/usr/local/lib/python3.10/dist-packages/gnomonicus/gnomonicus.py", line 451, in populateEffects raise InvalidMutationException(gene, mutation) gnomonicus.gnomonicus.InvalidMutationException: Rv3083@-7_del_tggagccatgaaccagcatttcgacgtcctgatcatcggcgccggcctatccggcatcgggacggcctgtcacgtgacggccgagttccccgacaagacaatcgccctcctggaacgacgggagcgcctgggcggcacctgggacttgttccgctacccgggagttcgttcggactccgacatgttcaccttcggctacaagttccgcccgtggcgcgacgtgaaggtgctcgccgacggcgcgtcgatccggcagtacatcgccgacaccgccacggagttcggcgtcgacgagaagattcactacggcctgaaggtcaacaccgccgagtggtcgagccggcagtgccgttggaccgtcgcgggcgtgcacgaggcgaccggcgaaacccggacctacacctgcgattacctcatcagctgcaccggctactacaactacgacgcgggttatctgccggacttccccggcgtgcaccggttcggcggccggtgcgtgcacccgcagcactggcccgaagacctcgattattccggcaagaaggtcgtcgtcatcggcagcggcgcaacggcggtcactttggttccggcgatggccggctccaaccccggcagtgccgcgcacgtgacgatgctgcagcgatccccgtcgtacatcttctcgctgccggcggtcgacaagatctccgaagtcctgggccgcttcctgccggatcgctgggtctacgagtttggccgcaggcgcaacatcgccatccagcgaaagctctaccaggcctgccggcgctggcccaagctgatgcggcgattgctgctgtgggaggtacgacgccgcctcggccgctccgtggacatgagcaacttcaccccgaactacctgccgtgggacgagcggttgtgcgccgtgcccaacggcgatctgtttaagacgctggcctcgggcgcggcgtcggtggtgaccgatcagatcgagaccttcaccgagaagggcatcctgtgcaagtccggccgggagatcgaggccgacatcatcgtcaccgcgaccggtctgaacatccagatgctgggcgggatgcgactcatcgtggacggcgccgaataccagctgccggagaagatgacctataagggtgtgctgctggaaaacgcccccaatctggcctggatcatcggctacaccaacgcgtcatggaccctgaagtccgacatcgccggcgcctacctgtgccggctgctgcggcacatggccgacaacggctacacggtggcaacgccgcgcgatgcgcaggactgcgcgctggacgttggcatgttcgaccagctgaactccggctatgtgaagcgcggccaggacatcatgccgcgccagggctccaagcatccgtggagggtgctcatgcactacgagaaggacgccaagatcctgctcgaagaccccatcgatgacggcgtgctgcacttcgccgcagcggcccaagaccacgcggcggcctgagcatcatgaacctgcgcaaaaacgtcatccggtccgtattacgtggtgcccggccactgttcgcttcccgccggctgggtattgccggccgtcgagtcctgctggcgacgctgacggccggcgcgcgcgcccccaagggcacccgctttcagcgcgtcagcatcgccggtgtcccggtccagcgggtgcaacccccccatgcggcaaccagcgggacgctgatctacctgcacggcggtgcctacgccctgggcagcgcccggggctaccgcggcctggccgcccagctcgcggcggcggccggaatgacggcgctggtccccgactacacccgcgcaccgcacgcccactatccagtggccctcgaagagatggctgcggtgtacacccgcttgctcgacgacgggctcgacccgaaaacgaccgtcatcgccggtgattcggctggcggagggttgaccctggcgctggccatggcgctgcgcgatcgcggcatccaggccccggccgcactcggcctgatctgcccgtgggccgatctcgccgtcgacatcgaagcgacgcgaccggcgctgcgcgatccgctcattcttccgtcgatgtgcaccgaatgggcgccgcgctacgt is not a valid mutation!

JeremyWesthead commented 1 year ago

Hi,

I can't see the VCF files for this, but I think the issue is that this is denoting a deletion which crosses the end of the gene. Rv3083 has a length of 1488, and this deletion is length 2123. As it crosses the gene end, this error is the expected behavior as it does not make sense to denote a gene mutation which is past the end of the gene (GARC does not currently support non-coding and non-promoter gene mutations.

I'm not entirely sure why this works for one VCF rather than the other though. However, gnomonicus is only designed to work with minos VCF files currently, which I believe is the final.vcf

alantsangmb commented 1 year ago

Sorry for missing the attachment previously. Hope the attachment is successful created this time.

cortex.vcf.gz final.vcf.gz samtools.vcf.gz

iqbal-lab commented 1 year ago

Random ideas

JeremyWesthead commented 1 year ago

The VCFs are all upper case, and regardless, internally gumpy converts to lower case when parsing so that shouldn't be an issue.

My apologies, I've just double checked the code - its been a few months since I last worked on this, and at one point that was the behavior. gumpy does actually handle this gracefully. It checks that the deleted bases match the reference bases within the gene, then stops checking once it hits the end of the gene. This should therefore work fine for deletions past the gene end.

Comparing the VCFs, the final.vcf denotes this exact deletion whereas the cortex.vcf actually denotes an extended version of this. I'll investigate if I can find a reason why this changes the behavior as it probably should not

martinghunt commented 1 year ago

It looks to me like indels around there meaning there's more than one way to represent the same variants(s).

Strictly speaking, it's not exactly a deletion. Cortex has the alt allele CAGGGTCCATGA. Minos has instead split into 3 separate deletions, all of which have alt allele length 1. There's a few bp between those 3 variants, so they are distinct. The end coord in the ref (3451398 if I got my numbers correct) is the same for the cortex variant and for the last of the 3 minos variants.

As usual, indels making this hard because there's often several ways to represent the same variation!

JeremyWesthead commented 1 year ago

I've done some more digging, and the reason it was giving different results was due to how gumpy parsed the VCF.

In the minos VCF, the deletion is given very simply as a ref and an alt differing by just the deletion (resulting in a gene position of -7 for the deletion)

In the cortex VCF, the ref and the alt differ by the deletion and 3 SNPs. Internally, gumpy splits this such that the position of the deletion minimizes the number of SNPs caused. The position which it found to be the best was actually different positioning of the deletion - rather placing it at gene position 5.

Because the gene indexing goes ..., -2, -1, 1, 2, ..., it skips 0. When checking against the sequence, gumpy was not accounting for this when iterating it. This means that the cortex VCF's parsed mutation had the intended behavior of correctly checking the sequence, but the minos had an off-by-1 indexing issue checking against the sequence.

I have now fixed and released this patch for gumpy, and updated the version requirement for gnomonicus

alantsangmb commented 1 year ago

Thank you so much! gnomonicus run successfully with the minos VCF using the latest release of the docker image.