oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
5 stars 0 forks source link

No entry found in the catalogue #5

Closed alantsangmb closed 1 year ago

alantsangmb commented 1 year ago

Hello, thank you for developing gnomonicus. I am trying to use gnomonicus to process the vcf files produced by clockwork and run against the catalogue file "NC_000962.3_WHO-UCH-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv" Most of the test cases can complete successfully, however I encounter the following error for one sample warning that there is no entry found in the catalogue for katG@N660HINH.

I have checked the data with another variant caller and there is a A1978C (Asn660His) mutation in the katG gene. Just wondering if gnomonicus would stop and exit the analysis when the mutation is not found in the catalogue? Thank you.

100%|███████████████████████████████████| 1650/1650 [00:00<00:00, 358766.28it/s] 100%|███████████████████████████████████████| 1079/1079 [03:59<00:00, 4.50it/s] 76%|███████████████████████████▎ | 1317/1737 [00:00<00:00, 11020.29it/s] Traceback (most recent call last): File "/usr/local/bin/gnomonicus", line 84, in effects, metadata = populateEffects(options.output_dir, resistanceCatalogue, mutations, referenceGenes, vcfStem) File "/usr/local/lib/python3.10/dist-packages/gnomonicus/gnomonicus.py", line 455, in populateEffects prediction = resistanceCatalogue.predict(gene+'@'+mutation) File "/usr/local/lib/python3.10/dist-packages/piezo/catalogue.py", line 33, in predict return predict(self.catalogue, mutation=mutation, verbose=verbose) File "/usr/local/lib/python3.10/dist-packages/piezo/catalogue.py", line 130, in predict result = predict_GARC1(catalogue,mutation,verbose) File "/usr/local/lib/python3.10/dist-packages/piezo/grammar_GARC1.py", line 176, in predict_GARC1 raise ValueError("No entry found in the catalogue for "+gene_mutation+compound) ValueError: No entry found in the catalogue for katG@N660HINH

JeremyWesthead commented 1 year ago

Hi,

I think this is actually an issue with the catalogue. I have just checked and the latest version of that catalogue is (erroneously) missing some of the default rules. The missing rule in question here which should be present is katG@*? The AMR catalogues for use with this should cover generic cases with default predictions for genes which confer resistance, so this is behavior is actually a feature for finding such catalogue issues rather than a bug.

I have made this fix within the catalogue parsing, but it may take a few days to be published onto the tuberculosis_amr_catalogues repo. Could you try this run again with this version of the catalogue to check if this fixes it?

alantsangmb commented 1 year ago

This version of the catalogue fixes it. Thank you so much!

JeremyWesthead commented 1 year ago

@philipwfowler has since found an error in the catalogue version which I sent. For a currently unknown reason, some mutations such as katG@371_del_g,R were removed. I'm currently looking into why, and fixing it, but I advise you don't use this catalogue for anything in the meantime

JeremyWesthead commented 1 year ago

An updated version of the WHO catalogue can now be found in the tuberculosis_amr_catalogues repo here This should fix the issue, but let me know if you find any other problems

alantsangmb commented 1 year ago

Hi, I noticed the effect was assigned as "U" for some mutations (e.g. rpsL L81R, gid L74R, katG W161L, and embB W1089G) in these drug related genes but these mutations are not listed in the WHO Mutation Catalogue. I would like to make sure I understand it correctly, they are assigned as "U" because of the rule: rpsL@?, gid@?, katG@? and embB@?, but they are indeed not included in the WHO Mutation Catalogue. I ask because I would like to differentiate what mutations are identified as "uncertain significance" in the WHO Mutation Catalogue and what are actually not classified in the catalogue.

I would also like to ask what is the meaning of the rules having an equal sign ("="), e.g. rpsL@*=. What are some examples of mutations that match this pattern/rule?

Thank you in advance, and sorry for the stupid questions.

JeremyWesthead commented 1 year ago

Hi,

The WHO catalogue (and other catalogues) consist of specific mutations which confer phenotypes, such as rpoB@S450L. They also include default rules so that any mutation compared to the catalogue is able to retrieve a prediction. These are based on previous literature and logic - originating during the CRyPTIC project, and specify that for mutations other than ones explicitly specified:

In the case of the WHO catalogue, the publication refers to a set of expert rules which further extends these default rules, some of which are specific, some are more general. For example, rpoB@S431? is an expert rule, but so is pncA@-126_del_c. Differentiating the mutations within the catalogue, and those covered by the default/expert rules is therefore an ambiguous task and is not recommended. The default rules also never specify R, and so should only give uncertainty at most (which would otherwise be a logical conclusion).

Rules of the form gene@*= are denoting any synonymous amino acid mutation in this gene. These are therefore default rules which usually denote S. There is a priority handling, so that such default rules are only used if there is not a more specific mutation in the catalogue. An example of this would be for fabG@L203L, which confers resistance, but would otherwise be covered by fabG1@*= produces a prediction of R as expected.

A more detailed breakdown of GARC (the nomenclature used for mutations) can be found here

alantsangmb commented 1 year ago

Thank you so much for explaining in detail.

Sorry that I missed a asterisk in my previous question to denote the wildcard expression. So the mutations in these genes will be labelled as uncertain by gnomonicus because of matching the pattern: grpsL@?, gid@?, katG@? and embB@*?, even if the specific non-synonymous mutations are not listed in the WHO Mutation Catalogue,

philipwfowler commented 1 year ago

Yes, that is correct. The idea is any mutation in a resistance gene not in the catalogue automatically gets returned with an "unknown" phenotype. In fact most of the rows in the version of the WHO version 1 catalogue are unknown ("3") and the version we have parsed and is available on GitHub drops these since they will be picked up by these general rules.

alantsangmb commented 1 year ago

Thank you so much