oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
5 stars 0 forks source link

insertions in reverse complement genes #13

Closed philipwfowler closed 1 year ago

philipwfowler commented 1 year ago

I think there is a bug when parsing insertions in reverse complement genes only. I've made a VCF file attached with an insertion/deletion in forward/reverse genes. They are all near the start of each gene to make life a bit easier.

insertion in forward, rpoB

We aim to insert three bases after the 4th base (759810, g) and before the 5th base (759811,c) hence encode in VCF as

NC_000962.3 759810  .   G   GTCG

and this is reported correctly as 759810_ins_tcg in variants and 4_ins_tcg in mutations.

insertion in reverse, pncA

To insert three bases after the 4th base (2289238, c) in a reverse complemented gene, we have to insert three bases after the 5th base (2289237, g) like

NC_000962.3 2289237 .   C   CTCG

and this is reported correctly as 2289237_ins_tcg in variants but rather than get the expected 4_ins_cga in mutations we get instead 5_ins_cga.

deletion in forward, gyrA

To delete the 4th and 5th bases, we specify the 3rd base and then drop the 4th and 5th in the ALT column. Hence for gyrA this is

NC_000962.3 7305    .   ACA A

and this is reported correctly as 7306_del_ca in variants but incorrectly in mutations as 5_del_ca rather than the expected 4_del_ca.

deletion in reverse gene, katG

We aim to delete the 4th and 5th bases. BUT because the VCF talks forward strand we need to structure this as starting at the 6th base (which is upstream and a c on the reverse strand) and delete the "next" two bases i.e.

NC_000962.3 2156106 .   GGG G

this is correctly reported as 2156107_del_gg and in mutations is correctly reported as 4_del_cc.

Can test with VCF and catalogue attached.

test_004.zip

JeremyWesthead commented 1 year ago

Issue with revcomp insertions should be fixed by this commit