CGE phenotype not populated for E. faecium pbp5 when all essential mutations are present

Hello,

I have two Enterococcus faecium assemblies that I have run through staramr v0.10.0 using the default databases with --pointfinder-organism enterococcus_faecium --pid-threshold 90 . Assembly1 has all 19 pbp5 mutations that contribute to ampicillin resistance according to the CGE key. Assembly2 has 18/19 mutations, and is missing the pbp5 (P667S) mutation, which from my understanding is a mandatory mutation listed in the complex pbp5 mutations file.

The detailed summary output for assembly 1 (all 19/19 pbp5 mutations) show the pbp5 mutations get aggregated to a single line and the CGE Phenotype is left blank:	Isolate ID	Data	Data Type	Predicted Phenotype	CGE Predicted Phenotype	%Identity	%Overlap	HSP Length/Total Length	Contig	Start	End	Accession
assembly1	pbp5 (A216S), pbp5 (A499T), pbp5 (A68T), pbp5 (D204G), pbp5 (E100Q), pbp5 (E525D), pbp5 (E629V), pbp5 (E85D), pbp5 (G66E), pbp5 (K144Q), pbp5 (L177I), pbp5 (M485A), pbp5 (N496K), pbp5 (P667S), pbp5 (R34Q), pbp5 (S27G), pbp5 (T172A), pbp5 (T324A), pbp5 (V24A)	Resistance	ampicillin		94.56	100.15	2040/2037	contig_104	44961	42922

The detailed summary output for assembly 2 (18/19 pbp5 mutations, missing one mandatory P667S) has the pbp5 mutations listed separately one per line and CGE phenotype is populated:	Isolate ID	Data	Data Type	Predicted Phenotype	CGE Predicted Phenotype	%Identity	%Overlap	HSP Length/Total Length	Contig	Start
assembly2	pbp5 (A216S)	Resistance	unknown[pbp5 (A216S)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (A499T)	Resistance	unknown[pbp5 (A499T)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (A68T)	Resistance	unknown[pbp5 (A68T)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (D204G)	Resistance	unknown[pbp5 (D204G)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (E100Q)	Resistance	unknown[pbp5 (E100Q)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (E525D)	Resistance	unknown[pbp5 (E525D)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (E629V)	Resistance	unknown[pbp5 (E629V)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (E85D)	Resistance	unknown[pbp5 (E85D)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (G66E)	Resistance	unknown[pbp5 (G66E)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (K144Q)	Resistance	unknown[pbp5 (K144Q)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (L177I)	Resistance	unknown[pbp5 (L177I)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (M485A)	Resistance	unknown[pbp5 (M485A)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (N496K)	Resistance	unknown[pbp5 (N496K)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (R34Q)	Resistance	unknown[pbp5 (R34Q)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (S27G)	Resistance	unknown[pbp5 (S27G)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (T172A)	Resistance	unknown[pbp5 (T172A)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (T324A)	Resistance	unknown[pbp5 (T324A)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354
assembly2	pbp5 (V24A)	Resistance	unknown[pbp5 (V24A)]	Ampicillin	94.52	95.83	1952/2037	contig_119	2403	4354

It's not clear to me why the CGE phenotype column for assembly1 is left blank when according to the CGE key these 19 mutations are required for ampicillin resistance, and yet the CGE phenotype column is populated "Ampicillin" for assembly2 when one of the essential mutations is missing. It seems the Predicted Phenotype column is calling them as I would expect: "ampicillin" for assembly1, and "unknown[pbp5 (mutation)]" for assembly 2.

What's happening is we have a process that explicitly looks for complex mutations (as defined in the complex mutations file) that replaces a set of individually reported point mutations with a single complex mutation if there's an appropriate match.

Individual Enterococcus faecium pbp5 Point Mutations

In your example with 18 Enterococcus faecium pbp5 point mutations, there aren't enough mutations to perform one of the replacements defined in the complex mutations file. So the 18 pbp5 point mutations remain listed individually, and information associated with each of them is pulled from the CGE Pointfinder Enterococcus faecium database and provided to the user of StarAMR.

The CGE Pointfinder database treats these 18 mutations the same as every other point mutation (listing Ampicillin as the conferred resistance), but expects the user to read the entry in the Notes column for these mutations, which read:

The nineteen pbp5 mutations must be present simultaneously for resistance phenotype.

We show this note from CGE in StarAMR's pointfinder.tsv output file, which would look something like this (showing 1 non-header row instead of many):

Isolate ID	Gene	Predicted Phenotype	CGE Predicted Phenotype Type	Position	Mutation	%Identity	%Overlap	HSP Length/Total Length	Contig	Start	End	Pointfinder Position	CGE Notes	CGE Required Mutation	CGE Mechanism	CGE PMID
pbp5_19_failure pbp5 (A216S)	unknown[pbp5 (A216S)]	Ampicillin	codon	216	GCA -> AGT (A -> S)	98.38	100.00	2037/2037	pbp5_1_AAK43724.1	1	2037	A216S	The nineteen pbp5 mutations must be present simultaneously for resistance phenotype		Target modification	25182648

So in this case we're showing the CGE-predicted phenotype as ampicillin because that's what CGE reports for these exact mutations. We show the note that CGE provides alongside these predictions in StarAMR's pointfinder.tsv output file, but I believe not in detailed_summary.tsv to have less clutter in this file. The detailed summary could be changed to show the CGE Notes column though.

Complex Enterococcus faecium pbp5 Point Mutations

In your example with 19 Enterococcus faecium pbp5 point mutations, there are enough mutations to perform one of the replacements, so the 19 pbp5 mutations are replaced with a single complex mutation. These complex mutations are defined in the complex mutations file. The resulting complex mutation does not list ampicillin as a CGE Predicted Phenotype because CGE does not predict the complex mutation the same way as we are predicting them. CGE predicts the complex mutations as:

The nineteen pbp5 mutations must be present simultaneously for resistance phenotype

In this case, meaning exactly the following mutations:

V24A
S27G
R34Q
G66E
A68T
E85D
E100Q
K144Q
T172A
L177I
D204G
A216S
T324A
M485A,T
N496J
A499T,I
E525D
E629V
P667S

With (to my interpretation) V586L being non-essential:

pbp5 pbp5 586 GTA V L Ampicillin 25182648 Target modification Not essential for resistance phenotype

Whereas our complex mutations file defines either:

M485A
E629V
P667S

M485T
E629V
P667S

as mandatory and then grabs all other related mutations and collapses them into a single complex mutation file.

Importantly, the CGE Pointfinder database does NOT make the same predictions we use in the complex mutations file (3 vs 19), so it would be wrong to list ampicillin under the CGE Predicted Phenotype column in your example. Instead, we list ampicillin in the adjacent Predicted Phenotype column in StarAMR's detailed_summary.tsv output file.

Summary

StarAMR is trying to wrap around CGE's databases (Pointfinder in this case) and is trying to relay this information back to the user of StarAMR. The complex mutation functionality implemented by StarAMR is an attempt to allieviate some issues with how CGE reports this data by default (expecting users to interpret it themselves).

In your example with the complex mutation replacement, CGE is not predicting the mutations we've manually defined in the complex mutations file, so we report it under the Predicted Phenotype column instead of the CGE Predicted Phenotype in the detailed_summary.tsv output file. There is maybe an argument that if the exact 19 mutations specified by CGE are present, then in that specific case ampicillin should appear under the CGE Predicted Phenotype, but not in the case of the other predicted complex mutations. However, this is adding an exception to the exception, so I think it would have to be a priority.

In your example with the incomplete complex mutation, CGE reports these mutations like any other, showing Ampicillin under the Resistance column, but expects the user to read to the Notes column. We show CGE's notes in StarAMR's pointfinder.tsv output file but not the detailed_summary.tsv output file. There's a possibility to hard code these mutations to never be reported unless a complex set of them is found, but again, it's writing yet another exception to CGE's Pointfinder data (which has many), so I think it would have to be considered a priority to make such a change.

There are of course solutions to these problems, but a risk with writing many specific exceptions and special cases is that it makes it harder to update the databases used by StarAMR when CGE updates the databases on their end. It's ultimately matter of priorities though.

phac-nml / staramr