oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
4 stars 0 forks source link

minor alleles incorrectly classified as dominant for `1/1` calls #26

Closed philipwfowler closed 1 year ago

philipwfowler commented 1 year ago

Through a comparison of ~1600 samples from South Africa, focussing on the gene pncA, I've found that a row in the VCF with a MIN_FRS filter fail but was called as 1/1 is recorded as a dominant variant/mutation (e.g. Q141P) rather than a minor variant (e.g. Q141P:0.68).

Using the VCF below if I issue

$ awk '$2>2288681 && $2<2289241 {print $0}' site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample.vcf
NC_000962.3 2288820 .   T   G   .   MIN_FRS .   GT:DP:ALLELE_DP:FRS:COV_TOTAL:COV:GT_CONF:GT_CONF_PERCENTILE    1/1:98:44,98:0.6901:142:44,98:336.55:4.14
NC_000962.3 2289067 .   A   G   .   MIN_FRS .   GT:DP:ALLELE_DP:FRS:COV_TOTAL:COV:GT_CONF:GT_CONF_PERCENTILE    0/0:159:159,24:0.8689:183:159,24:841.05:38.46

I can see there are two rows for pncA, both with MIN_FRS filter fails. One is 0/0 and the other 1/1. Then if I do

$ gnomonicus --genome_object ~/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3.gbk --catalogue_file ~/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv --csvs all --output_dir . --progress --vcf_file site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample.vcf --minor_populations minor_alleles.txt
$ grep pncA site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample.mutations.csv
site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample,pncA,Q141P,cag,ccg,,,141.0,True,,None,141.0,P,1
site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample,pncA,S59P:0.131,tcc,zzz,,,59.0,True,,None,59.0,P,1

I see one of these is reported as a minor allele, but the other (the 1/1) is reported as a dominant mutation. If I try without the --minor_populations flag then neither row is reported, as expected.

$ gnomonicus --genome_object ~/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3.gbk --catalogue_file ~/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv --csvs all --output_dir . --progress --vcf_file site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample.vcf
$ grep pncA site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample.mutations.csv

Hence it looks like there is a bug when (i) the minor populations are being investigated and (ii) for variants which are called 1/1.

site.10.subj.AG01753383.lab.AG01753383.iso.1.v0.12.4.per_sample.vcf.gz minor_alleles.txt

JeremyWesthead commented 1 year ago

I think this is due to the VCF specifying that it is a 1/1 call, despite having a MIN_FRS filter fail. When the minor populations flag is used, it also implicitly uses the ignore_filter flag to ensure that such rows make it to a point where the minor populations are found. However, as this has a non-wild-type call if you ignore the filter fail - it results in it being reported as an actual call.

I'll do some digging - hopefully it shouldn't be too difficult to fix