Closed ivagljiva closed 6 months ago
I am pinging @Kekananen here since this is relevant to a project we are working on :)
@ivagljiva There seems to be a minor issue in the anvio/__init__.py
file where the key wasn't added to the dictionary. I fixed it on my local version by slapping this in:
'include-stray-KOs': (
['include-stray-KOs'],
{'default': False,
'action': 'store_true',
'help': "Attempts to rescue the stray KO"}
),
Could be something on my end with the wonky way I'm running it, but just incase thought I would let you know. Besides that, it's running like a champ! Thanks for getting this implemented in.
Oh, that's weird @Kekananen! I added this parameter to the dictionary very early on and I still see it in the code:
but at any rate, you fixed it on your local in the right way :) I'm glad it's running!
This is a lot of work, Iva. I've been going through it on and off in various sits today. Thank you very much for your diligence in extensive documentation but in anvi'o docs and function headers and for your clean code :) All of this looks very good to me and ready to go!
This PR implements a strategy for annotating KOs that do not come with a pre-defined bit score threshold from KEGG, as well as a strategy for using these annotations for metabolic pathway prediction with
anvi-estimate-metabolism
. It also introduces a new default KEGG data snapshot for anvio-dev.Background on 'stray KOs' and why we want to annotate them
The KOs without bit score thresholds are those that have fewer than 3 non-redundant sequences in the gene family, and hence cannot undergo the rigorous process that KEGG uses to estimate bit score thresholds (as described in Aramaki et al). We've decided to call them "stray KOs" because they live separately in our KEGG data directory from the "regular" KOs, they are a little bit lonely in their very small gene families (often a "family" of 1), their lack of bit score thresholds can lead your analyses astray (see what we did there) with false positive annotations if you are not careful about how you annotate them, and most importantly because we got tired of saying "KOs without bit score thresholds" every time. Henceforth, we'll refer to them as "stray KOs", or simply "strays".
Because stray KOs have no thresholds to help us distinguish between strong and weak hits to their profile HMMs, anvi'o never tried to annotate them before, to avoid the risk of introducing false positive annotations from weak matches. Many of these strays represent functions coming from macroorganisms that anvi'o doesn't really work with, so we lived with this strategy of avoidance for several years. However, many of the strays are coming from microbial organisms, and represent important functions that scientists are interested in finding -- including several pathogenicity factors like toxins (e.g., K1688, K10919, K10961), enzymes for biosynthesis of antibiotics (e.g. K24557, K24558, K14369), and enzymes for diverse metabolic pathways (e.g. K20171, K21168). All of these examples are also coming from pathways in KEGG MODULE, which means that without annotating the stray KOs, those pathways can never be identified as complete with
anvi-estimate-metabolism
, even if the function truly exists in the genome(s) of interest.So clearly, there is some benefit to being able to annotate these gene families. However, it must be done with care. If you simply annotate every match to a stray KO HMM without having some way to filter out the weak hits (as some annotation tools do), you can risk introducing garbage (false positive) annotations.
How can we annotate them? By estimating a threshold for filtering out weak hits
The best way to avoid false positive annotations when working with these stray KOs is to estimate a rather conservative bit score threshold, so that we can annotate them just like we do the "regular" KOs. We investigated a couple of strategies for estimating such a threshold, but in the end decided to go with a very simple algorithm: run
hmmscan
of each stray KO model against a positive set of gene sequences that were used to create the model, and take the minimum bit score of those hits as the threshold. The idea is that if we created the HMM from a set of orthologous sequences, we should at least be able to annotate those sequences with the estimated threshold, and then our heuristic for adding back decent hits can enable the annotation of slightly more distant homologs that the very strictly-defined model doesn't exactly match to.We implemented this strategy as part of
anvi-setup-kegg-data
. However, when we tested it, we found that some of the resulting thresholds were not conservative enough, leading to some garbage annotations (i.e., a bacterial genome was getting annotated with a stray KO built from tomato genes). We realized that this was happening because not all of the gene sequences assigned to a given KO were used to create the KOfam model, because the KEGG ORTHOLOGY and KEGG GENES databases are updated more often than KOfam. So we came up with the following amended strategy for estimating bit score thresholds:muscle
and runhmmbuild
to create a new HMM model that incorporates all of the associated KEGG GENES (these new models are named with "_anvio_version" after the KO accession to distinguish them from the KEGG-provided KOfams)hmmscan
against the set of KEGG GENES used to create itAfter this is done, the thresholds and their associated models (KEGG version or anvi'o version) are written to files in the
orphan_data
folder of the KEGG data directory, where they can be used downstream byanvi-run-kegg-kofams
.This strategy has worked well in our tests -- no garbage annotations are included downstream. :)
New flag:
--include-stray-KOs
We've implemented a new flag for working with the stray KOs, and it works with all three main programs of the metabolism suite. Here is what happens if you use
--include-stray-KOs
with each of these programs:anvi-setup-kegg-data
: the aforementioned strategy is run for each stray KO, to create new HMMs (if necessary) and estimate their bit score thresholds. The data is stored in the KEGG data directory. Most people won't have to run this for themselves, since I intend to include these data in the KEGG snapshots that are downloaded by default usage of this programanvi-run-kegg-kofams
: after annotating the regular KOs, this program uses the stray KO models and their estimated thresholds to include stray KO annotations (if any) in the contigs database. Note that our heuristic for relaxing the bit score threshold applies to stray KOs, too (unless you turn it off explicitly)anvi-estimate-metabolism
: if you use this flag, the annotations from stray KOs will contribute to calculations of pathway completeness and copy number. If an annotation comes from an anvi'o version of the KO model, that will be noted in thewarnings
column of long-format output. However, you won't see any warnings about missing KOfam profiles (since we are now including the ones that were missing).Please keep in mind that the default behavior of these programs (ie, without the flag) remains the same :) No flag, no stray KOs. So people who don't care about these KO families can carry on as they did before.
New flag for
anvi-estimate-metabolism
:--ignore-unknown-KOs
Just because you ran
anvi-run-kegg-kofams
with the--include-stray-KOs
flag doesn't mean you have to use any resulting annotations downstream foranvi-estimate-metabolism
. If you annotate stray KOs, you can leave those annotations out of the metabolism estimates by simply not including the flag. Sometimes (depending on what stray KOs are annotated, if any) this can lead toanvi-estimate-metabolism
not recognizing some KO accessions in the annotations and stop with an error message (because if it isn't told to use stray KOs, it doesn't load their info and a sanity check may be triggered as a result). To get around the error message while still ignoring stray KO annotations, simply add the new flag--ignore-unknown-KOs
.Note that there is no sanity check for doing things the other way around: you can run
anvi-estimate-metabolism
with the--include-stray-KOs
flag even if you haven't annotated the database with stray KOs. More on this in the next section.Caveats and consequences
In tests, we've noticed that generally very few stray KOs are annotated. In my opinion, this is evidence that the estimated thresholds are quite conservative: that is, since the models are built from a very limited set of genes (often specific to a single microbial species), you are most likely to add annotations if your genome of interest includes a gene was was used to create a stray KO model.
One caveat, then, is that if you annotate using these stray KO models and you don't get an annotation to a function of interest, it doesn't necessarily mean that this function is not present -- perhaps we were just too conservative to annotate it. It is difficult to prove absence of function in biology in general, but especially with this conservative strategy. And it's good to keep this in mind for metabolism analysis as well, since that depends on the annotations.
When you look at the metabolism estimation results that were generated with the
--include-stray-KOs
flag, note that any anvi'o versions of KO models will not be indicated in most data columns (especiallyenzyme_hits_in_module
), but instead will be mentioned in the warnings column (ie, "used 'K10919_anvio_version' model to annotate K10919").Also,
anvi-estimate-metabolism
can only take advantage of stray KO annotations if you actually annotated them. It is possible to annotate your database without including stray KOs and later try to runanvi-estimate-metabolism --include-stray-KOs
(there is no sanity check against it), but doing so won't change the metabolism estimation results except to remove the warnings about missing profiles from the warnings column (which could end up being misleading if you've come to expect that an unannotated KO without a warning is truly not present in the genome).The new KEGG snapshot including stray KOs
We generated a new KEGG snapshot,
v2024-03-09
, which is the new default snapshot if you runanvi-setup-kegg-data
without any arguments. It contains KOfam and KEGG MODULES data, but not the reaction network data (an update to this codebase is coming soon, and Sam told me that including this data was not necessary). It also contains the models and estimated thresholds for 1,302 stray KOs.Only a small fraction of stray KOs got new "anvi'o versions" of their HMMs in this snapshot. There are 340 new HMMs, and the remaining 962 stray KOs use their original KEGG KOfam models. The models (both new and original) can be found in the
orphan_data/anvio_hmm_profiles_for_stray_KOs.hmm
file and their estimated bitscore thresholds can be found in theorphan_data/estimated_thresholds_for_stray_kos.txt
file.481 KEGG modules are included in the MODULES.db in this snapshot, which has the hash value
23910d68b4f2
. Of these, 55 modules contain at least one stray KO (only 30 with stray KOs using new 'anvio_version' HMMs) and could therefore be affected by using the--include-stray-KOs
flag.Anyone using anvio-dev who wants to try out this new feature would have to run
anvi-setup-kegg-data
to get this latest snapshot (those who already have an existing KEGG data directory should probably use--kegg-data-dir
to avoid messing up any existing projects that use the previous version of KEGG).A couple of examples using
--include-stray-KOs
Here, I copy the results of some of my tests, to both demonstrate that the new strategy works and to show people how to use it.
I tested this on a reference S. aurantiaca genome [from NCBI(https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_900109545.1/), which I suspected would include functions for the KEGG module M00848 (Aurachin biosynthesis), a module that involves several stray KOs. Here is how I annotated this genome:
And here is the relevant output of
anvi-run-kegg-kofams
, which indicates that 10 stray KOs were annotated in this genome:I then estimated metabolism with and without using the
--include-stray-KOs
flag, and compared the two results:The comparison showed the following:
That is, only two modules had differences in the
anvi-estimate-metabolism
output. The module M00815 was different only in thewarnings
column (it lost warnings about missing KOfam profiles when using--include-stray-KOs
), and as expected, the module M00848 went from a completeness score of ~0.0 to a score of ~0.3 when the stray KO annotations were included.That test didn't involve any stray KOs using new HMMs, so I also tested on a [Vibrio cholerae genome from NCBI](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006745.1/), which should contain M00852, the Vibrio cholerae pathogenicity signature module and a module that involves several stray KOs with anvi'o-generated HMMs.
Indeed, 6 annotations from stray KOs are found by
anvi-run-kegg-kofams
and all of them are annotated with anvi'o versions of the family's HMM:Comparing the metabolism output with and without including the stray KOs like this:
shows that 2 modules have differences only in the
warnings
column -- M00848 (Aurachin biosynthesis) and M00576 (ETEC pathogenicity signature) -- while only M00852 has a difference in completeness score. M00852 goes from a score of 0.64 (without the stray KO annotations) to a score of 1.0 (with strays included).