tseemann / abricate

:mag_right: :pill: Mass screening of contigs for antimicrobial and virulence genes
GNU General Public License v2.0
364 stars 90 forks source link

Add some sort of SCORE column #124

Open tseemann opened 4 years ago

tseemann commented 4 years ago

Hi

It is about getting Abricate output to record the BLAST bit scores. Currently we are engaging with the European Food Safety Authority (EFSA) on guidelines for WGS based reports, and they were want to put some sort of scoring in to the report. Since I know they must be generated as part of compiling the ABricate end results, I was wondering if they could be retrieved and sent to the final results, as a part of that table.

I ask this as we and others use Abricate extensively to report virulence and resistance genes

Thanks for your time in advance

Tom Dunlop

dswan commented 4 years ago

If I could second this request that would be great, I have previously made some local changes to Abricate to support FEEDAP/EFSA requirements.

The relevant EFSA (current) doc is:

https://efsa.onlinelibrary.wiley.com/doi/pdf/10.2903/j.efsa.2018.5206

Relevant sections:

2.2.2 WGS search for AMR genes

WGS should be interrogated for the presence of genes coding for or contributing to resistance to antimicrobials relevant to their use in humans and animals (CIAs or HIAs). For this purpose, a comparison against up‐to‐date databases should be performed (e.g. CARD, ARG‐ANNOT, ResFinder). The outcome of the analysis should be presented as a table focusing on complete genes coding for resistance to antimicrobials. The table should include at least the gene identification, function of the encoded protein, percentage of identity and e‐value.

2.4.1. Bacteria

For bacterial strains belonging to a species not included in the QPS list, WGS analysis should be used to identify genes coding for known virulence factors. For this purpose, comparison against specific up-to-date databases (e.g. VFDB, PAI DB, virDB, CGE) should be performed. The outcome of the analysis should be presented as a table focusing on complete genes encoding recognised virulence factors (e.g. toxins, invasion and adhesion factors) known to exist in the species or related species to which the strain belongs. The table should include at least the gene identification,function of the encoded protein, percentage of identity and e-value.

There is a consultation document that extends WGS requirements for reporting, I'm not sure that any of these impact Abricate however.

andersgs commented 4 years ago

This is very interesting @dswan. Can I ask how the e-value is used? How would you change your interpretation if you had 100% identity and coverage but the e-value did not meet expectations? How do you account for changing e-value with changing databases? Do e-values get included in reports to epidemiologists and clinicians?

I should say, I am not overly enthusiastic about including this information. I fear it would be misleading and confusing. Moreover, I don't think our context is right for these measures. These scores were designed to find homology by identity, whereas we are interested in a very different question. We want to know if a gene in that DB is found in our genome of interest. The e-value and bit-score, in particular, were designed with the idea of large, mainly, random and diverse sequence databases. Meanwhile, the AMR DBs are relatively small, and far from random, with lots of relatively similar sequences (in other words, lots of similar sequences so we are likely overestimating the "independent" size of our database). To me, that makes e-value rather meaningless in this context because it is about how likely we would find an equal match in a random DB of equal size. More importantly, we are probably grossly overestimating the e-values in this particular context, and thus leading to overconfidence.

In this context, I think percent identity and coverage are the ideal metrics. Not only are they easy to understand, but we can also say with confidence whether there is an exact match to a gene included in the DB in our genome of interest or not (or whether we have some close approximation).

I think to make the e-value and bit score meaningful in this context, we really should have a curated DB of all known bacterial genes annotated with different functions and species of origin, and then we would filter the output according to the function of interest.

dswan commented 4 years ago

@andersgs I don't disagree with any of your points. I don't however write the EFSA guidance, merely abide by it for regulatory approval. The public consultation rounds are an opportunity for scientists to engage with the process of setting European policy. I can comply with the requirements laid out without using Abricate, but many people will turn to the tool to do precisely this.