oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
5 stars 0 forks source link

edge case when VCF contains no valid rows #22

Closed philipwfowler closed 1 year ago

philipwfowler commented 1 year ago

This is a very edge case and only minor, but in testing on the 4,124 South African VCF files, one only has three rows, one of which is wildtype 0/0 and the other two fail a filter, hence it is identical to the H37Rv reference. This could happen more than we think in practice because labs may deliberately sequence H37Rv as a control. At present gnomonicus writes out no CSV files (which is ok) and the JSON contains the following

{
  "meta": {
    "status": "success",
    "workflow_name": "gnomonicus",
    "workflow_version": "2.0.0",
    "workflow_task": "resistance_prediction",
    "guid": "site.10.subj.UH01321968.lab.UH01321968.iso.1.v0.12.4.per_sample",
    "UTC-datetime-completed": "2023-07-12T07:57:57.657225",
    "time_taken_s": 4.025347709655762,
    "reference": "NC_000962",
    "catalogue_file": "/home/ubuntu/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv",
    "reference_file": "/home/ubuntu/packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3.gbk",
    "vcf_file": "dat/CRyPTIC2/V2/10/UH01321968/UH01321968/1/per_sample/site.10.subj.UH01321968.lab.UH01321968.iso.1.v0.12.4.per_sample.vcf",
    "catalogue_type": "RUS",
    "catalogue_name": "WHO-UCN-GTB-PCI-2021.7",
    "catalogue_version": 1.0
  },
  "data": {
    "variants": [],
    "antibiogram": {
      "RIF": "S",
      "INH": "S",
      "EMB": "S",
      "PZA": "S",
      "LEV": "S",
      "MXF": "S",
      "LZD": "S",
      "DLM": "S",
      "AMI": "S",
      "STM": "S",
      "ETH": "S",
      "KAN": "S",
      "CAP": "S"
    }
  }
}

Think this is all ok, but should we have empty "mutations" and "effects" blocks within "data"? i.e.

  "data": {
    "variants": [],
    "mutations": [],
    "effects": [],
    "antibiogram": {
      "RIF": "S",
      "INH": "S",
      "EMB": "S",
      "PZA": "S",
...
    }
  }
}

site.10.subj.UH01321968.lab.UH01321968.iso.1.v0.12.4.per_sample.vcf.zip

JeremyWesthead commented 1 year ago

Currently the mutations/effects are optional fields - only populating when there are values to go in them. If a sample has exclusively intergene variation, no gene-centric mutations exist, and no effects exist, so the fields aren't populated. I can change this to default to empty lists instead as it may may parsing it easier