oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
5 stars 0 forks source link

fails if a minor allele has equal numbers of reads in REF and ALT #18

Closed philipwfowler closed 1 year ago

philipwfowler commented 1 year ago

site.10.subj.YA00099926.lab.YA00099926.iso.1.v0.12.4.per_sample.vcf.zip

If I try and process the above VCF via

gnomonicus --vcf_file site.10.subj.YA00099926.lab.YA00099926.iso.1.v0.12.4.per_sample.vcf --genome_object ~/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3.gbk --catalogue_file ~/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv  --csvs all --json --output_dir . --minor_populations ../minor_alleles.txt

then I get

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/gnomonicus", line 118, in <module>
    variants = populateVariants(vcfStem, options.output_dir, diff, make_variants_csv, catalogue=resistanceCatalogue)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/gnomonicus/gnomonicus_lib.py", line 131, in populateVariants
    variants = pd.concat([variants, minority_population_variants(diff, catalogue)])
  File "/home/ubuntu/.local/lib/python3.10/site-packages/gnomonicus/gnomonicus_lib.py", line 399, in minority_population_variants
    assert added, f"The index of the VCF evidence could not be determined! {variant_} --> {vcf}"
AssertionError: The index of the VCF evidence could not be determined! 1472242c>t:1.0 --> {'GT': (0, 0), 'DP': 9.0, 'ALLELE_DP': (9.0, 9.0), 'FRS': 0.5, 'COV_TOTAL': 18, 'COV': (9, 9), 'GT_CONF': 0.0, 'GT_CONF_PERCENTILE': 0.02, 'POS': 1472242, 'REF': 'c', 'ALTS': ('t',)}

I've tried manually altering a row so the n_reads are different and it gets further before hitting another row and then hitting the same assertion.

minor_alleles.txt

JeremyWesthead commented 1 year ago

This was not actually caused by an equal number of REF/ALT reads. It was a bug due to using DP to provide the total number of reads (which is as it is defined in the VCF spec). This sample uses DP as the Mean read depth of called allele (ie the ALLELE_DP of the called allele) - causing issues. Work around for this is to use sum(vcf['COV']) instead of DP

This has a corresponding gumpy fix as DP was also used during VCF parsing, so this will not immediately be available on PyPi

martinghunt commented 1 year ago

Sorry, this is my fault for not following the VCF spec for DP in minos. It's complicated.

The COV entry has the total reads mapped to the variant site. But because the method is using a tool that maps reads multi-mapped to a graph, and we want to handle indels properly, we don't actually know DP defined as "read depth at this position" (unless the variant site is a simple SNP). We can only calculate mean depth per allele. And for non-snps, you can't rely on COV being the total depth, all we know is it's the total reads intersecting the site (alleles could be 10s/100s/1000s bp long).

To make things more complicated, a single read can be mapped to >1 allele (eg the alleles are long, 2 of them share a common prefix, and the read starts before the site and only overlaps the start of the alleles).

I'm up for changing the the format of the VCF, am open to suggestions to how, but not sure what that would be.