phac-nml / staramr

Scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases.
Apache License 2.0
120 stars 26 forks source link

CGE phenotype not populated for E. faecium pbp5 when all essential mutations are present #210

Open lerminin opened 5 months ago

lerminin commented 5 months ago

Hello,

I have two Enterococcus faecium assemblies that I have run through staramr v0.10.0 using the default databases with --pointfinder-organism enterococcus_faecium --pid-threshold 90 . Assembly1 has all 19 pbp5 mutations that contribute to ampicillin resistance according to the CGE key. Assembly2 has 18/19 mutations, and is missing the pbp5 (P667S) mutation, which from my understanding is a mandatory mutation listed in the complex pbp5 mutations file.

The detailed summary output for assembly 1 (all 19/19 pbp5 mutations) show the pbp5 mutations get aggregated to a single line and the CGE Phenotype is left blank: Isolate ID Data Data Type Predicted Phenotype CGE Predicted Phenotype %Identity %Overlap HSP Length/Total Length Contig Start End Accession
assembly1 pbp5 (A216S), pbp5 (A499T), pbp5 (A68T), pbp5 (D204G), pbp5 (E100Q), pbp5 (E525D), pbp5 (E629V), pbp5 (E85D), pbp5 (G66E), pbp5 (K144Q), pbp5 (L177I), pbp5 (M485A), pbp5 (N496K), pbp5 (P667S), pbp5 (R34Q), pbp5 (S27G), pbp5 (T172A), pbp5 (T324A), pbp5 (V24A) Resistance ampicillin 94.56 100.15 2040/2037 contig_104 44961 42922
The detailed summary output for assembly 2 (18/19 pbp5 mutations, missing one mandatory P667S) has the pbp5 mutations listed separately one per line and CGE phenotype is populated: Isolate ID Data Data Type Predicted Phenotype CGE Predicted Phenotype %Identity %Overlap HSP Length/Total Length Contig Start End Accession
assembly2 pbp5 (A216S) Resistance unknown[pbp5 (A216S)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (A499T) Resistance unknown[pbp5 (A499T)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (A68T) Resistance unknown[pbp5 (A68T)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (D204G) Resistance unknown[pbp5 (D204G)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (E100Q) Resistance unknown[pbp5 (E100Q)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (E525D) Resistance unknown[pbp5 (E525D)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (E629V) Resistance unknown[pbp5 (E629V)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (E85D) Resistance unknown[pbp5 (E85D)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (G66E) Resistance unknown[pbp5 (G66E)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (K144Q) Resistance unknown[pbp5 (K144Q)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (L177I) Resistance unknown[pbp5 (L177I)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (M485A) Resistance unknown[pbp5 (M485A)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (N496K) Resistance unknown[pbp5 (N496K)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (R34Q) Resistance unknown[pbp5 (R34Q)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (S27G) Resistance unknown[pbp5 (S27G)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (T172A) Resistance unknown[pbp5 (T172A)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (T324A) Resistance unknown[pbp5 (T324A)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354
assembly2 pbp5 (V24A) Resistance unknown[pbp5 (V24A)] Ampicillin 94.52 95.83 1952/2037 contig_119 2403 4354

It's not clear to me why the CGE phenotype column for assembly1 is left blank when according to the CGE key these 19 mutations are required for ampicillin resistance, and yet the CGE phenotype column is populated "Ampicillin" for assembly2 when one of the essential mutations is missing. It seems the Predicted Phenotype column is calling them as I would expect: "ampicillin" for assembly1, and "unknown[pbp5 (mutation)]" for assembly 2.

emarinier commented 5 months ago

What's happening is we have a process that explicitly looks for complex mutations (as defined in the complex mutations file) that replaces a set of individually reported point mutations with a single complex mutation if there's an appropriate match.

Individual Enterococcus faecium pbp5 Point Mutations

In your example with 18 Enterococcus faecium pbp5 point mutations, there aren't enough mutations to perform one of the replacements defined in the complex mutations file. So the 18 pbp5 point mutations remain listed individually, and information associated with each of them is pulled from the CGE Pointfinder Enterococcus faecium database and provided to the user of StarAMR.

The CGE Pointfinder database treats these 18 mutations the same as every other point mutation (listing Ampicillin as the conferred resistance), but expects the user to read the entry in the Notes column for these mutations, which read:

The nineteen pbp5 mutations must be present simultaneously for resistance phenotype.

We show this note from CGE in StarAMR's pointfinder.tsv output file, which would look something like this (showing 1 non-header row instead of many):

Isolate ID Gene Predicted Phenotype CGE Predicted Phenotype Type Position Mutation %Identity %Overlap HSP Length/Total Length Contig Start End Pointfinder Position CGE Notes CGE Required Mutation CGE Mechanism CGE PMID
pbp5_19_failure pbp5 (A216S) unknown[pbp5 (A216S)] Ampicillin codon 216 GCA -> AGT (A -> S) 98.38 100.00 2037/2037 pbp5_1_AAK43724.1 1 2037 A216S The nineteen pbp5 mutations must be present simultaneously for resistance phenotype Target modification 25182648

So in this case we're showing the CGE-predicted phenotype as ampicillin because that's what CGE reports for these exact mutations. We show the note that CGE provides alongside these predictions in StarAMR's pointfinder.tsv output file, but I believe not in detailed_summary.tsv to have less clutter in this file. The detailed summary could be changed to show the CGE Notes column though.

Complex Enterococcus faecium pbp5 Point Mutations

In your example with 19 Enterococcus faecium pbp5 point mutations, there are enough mutations to perform one of the replacements, so the 19 pbp5 mutations are replaced with a single complex mutation. These complex mutations are defined in the complex mutations file. The resulting complex mutation does not list ampicillin as a CGE Predicted Phenotype because CGE does not predict the complex mutation the same way as we are predicting them. CGE predicts the complex mutations as:

The nineteen pbp5 mutations must be present simultaneously for resistance phenotype

In this case, meaning exactly the following mutations:

With (to my interpretation) V586L being non-essential:

pbp5 pbp5 586 GTA V L Ampicillin 25182648 Target modification Not essential for resistance phenotype

Whereas our complex mutations file defines either:

or

as mandatory and then grabs all other related mutations and collapses them into a single complex mutation file.

Importantly, the CGE Pointfinder database does NOT make the same predictions we use in the complex mutations file (3 vs 19), so it would be wrong to list ampicillin under the CGE Predicted Phenotype column in your example. Instead, we list ampicillin in the adjacent Predicted Phenotype column in StarAMR's detailed_summary.tsv output file.

Summary

StarAMR is trying to wrap around CGE's databases (Pointfinder in this case) and is trying to relay this information back to the user of StarAMR. The complex mutation functionality implemented by StarAMR is an attempt to allieviate some issues with how CGE reports this data by default (expecting users to interpret it themselves).

In your example with the complex mutation replacement, CGE is not predicting the mutations we've manually defined in the complex mutations file, so we report it under the Predicted Phenotype column instead of the CGE Predicted Phenotype in the detailed_summary.tsv output file. There is maybe an argument that if the exact 19 mutations specified by CGE are present, then in that specific case ampicillin should appear under the CGE Predicted Phenotype, but not in the case of the other predicted complex mutations. However, this is adding an exception to the exception, so I think it would have to be a priority.

In your example with the incomplete complex mutation, CGE reports these mutations like any other, showing Ampicillin under the Resistance column, but expects the user to read to the Notes column. We show CGE's notes in StarAMR's pointfinder.tsv output file but not the detailed_summary.tsv output file. There's a possibility to hard code these mutations to never be reported unless a complex set of them is found, but again, it's writing yet another exception to CGE's Pointfinder data (which has many), so I think it would have to be considered a priority to make such a change.

There are of course solutions to these problems, but a risk with writing many specific exceptions and special cases is that it makes it harder to update the databases used by StarAMR when CGE updates the databases on their end. It's ultimately matter of priorities though.