merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
432 stars 145 forks source link

Enabling annotation of KEGG Orthologs without bit score thresholds ('stray KOs') #2237

Closed ivagljiva closed 6 months ago

ivagljiva commented 6 months ago

This PR implements a strategy for annotating KOs that do not come with a pre-defined bit score threshold from KEGG, as well as a strategy for using these annotations for metabolic pathway prediction with anvi-estimate-metabolism. It also introduces a new default KEGG data snapshot for anvio-dev.

Background on 'stray KOs' and why we want to annotate them

The KOs without bit score thresholds are those that have fewer than 3 non-redundant sequences in the gene family, and hence cannot undergo the rigorous process that KEGG uses to estimate bit score thresholds (as described in Aramaki et al). We've decided to call them "stray KOs" because they live separately in our KEGG data directory from the "regular" KOs, they are a little bit lonely in their very small gene families (often a "family" of 1), their lack of bit score thresholds can lead your analyses astray (see what we did there) with false positive annotations if you are not careful about how you annotate them, and most importantly because we got tired of saying "KOs without bit score thresholds" every time. Henceforth, we'll refer to them as "stray KOs", or simply "strays".

Because stray KOs have no thresholds to help us distinguish between strong and weak hits to their profile HMMs, anvi'o never tried to annotate them before, to avoid the risk of introducing false positive annotations from weak matches. Many of these strays represent functions coming from macroorganisms that anvi'o doesn't really work with, so we lived with this strategy of avoidance for several years. However, many of the strays are coming from microbial organisms, and represent important functions that scientists are interested in finding -- including several pathogenicity factors like toxins (e.g., K1688, K10919, K10961), enzymes for biosynthesis of antibiotics (e.g. K24557, K24558, K14369), and enzymes for diverse metabolic pathways (e.g. K20171, K21168). All of these examples are also coming from pathways in KEGG MODULE, which means that without annotating the stray KOs, those pathways can never be identified as complete with anvi-estimate-metabolism, even if the function truly exists in the genome(s) of interest.

So clearly, there is some benefit to being able to annotate these gene families. However, it must be done with care. If you simply annotate every match to a stray KO HMM without having some way to filter out the weak hits (as some annotation tools do), you can risk introducing garbage (false positive) annotations.

How can we annotate them? By estimating a threshold for filtering out weak hits

The best way to avoid false positive annotations when working with these stray KOs is to estimate a rather conservative bit score threshold, so that we can annotate them just like we do the "regular" KOs. We investigated a couple of strategies for estimating such a threshold, but in the end decided to go with a very simple algorithm: run hmmscan of each stray KO model against a positive set of gene sequences that were used to create the model, and take the minimum bit score of those hits as the threshold. The idea is that if we created the HMM from a set of orthologous sequences, we should at least be able to annotate those sequences with the estimated threshold, and then our heuristic for adding back decent hits can enable the annotation of slightly more distant homologs that the very strictly-defined model doesn't exactly match to.

We implemented this strategy as part of anvi-setup-kegg-data. However, when we tested it, we found that some of the resulting thresholds were not conservative enough, leading to some garbage annotations (i.e., a bacterial genome was getting annotated with a stray KO built from tomato genes). We realized that this was happening because not all of the gene sequences assigned to a given KO were used to create the KOfam model, because the KEGG ORTHOLOGY and KEGG GENES databases are updated more often than KOfam. So we came up with the following amended strategy for estimating bit score thresholds:

  1. obtain the set of KEGG GENE sequences associated with the stray KO
  2. if there is only one sequence in the set, use the KOfam model provided by KEGG (since the model was almost certainly created from that one sequence)
  3. if there is more than one sequence in the set, align them with muscle and run hmmbuild to create a new HMM model that incorporates all of the associated KEGG GENES (these new models are named with "_anvio_version" after the KO accession to distinguish them from the KEGG-provided KOfams)
  4. use the appropriate model (original or anvi'o version) to run hmmscan against the set of KEGG GENES used to create it
  5. take the minimum bit score of those hits as the threshold

After this is done, the thresholds and their associated models (KEGG version or anvi'o version) are written to files in the orphan_data folder of the KEGG data directory, where they can be used downstream by anvi-run-kegg-kofams.

This strategy has worked well in our tests -- no garbage annotations are included downstream. :)

New flag: --include-stray-KOs

We've implemented a new flag for working with the stray KOs, and it works with all three main programs of the metabolism suite. Here is what happens if you use --include-stray-KOs with each of these programs:

Please keep in mind that the default behavior of these programs (ie, without the flag) remains the same :) No flag, no stray KOs. So people who don't care about these KO families can carry on as they did before.

New flag for anvi-estimate-metabolism: --ignore-unknown-KOs

Just because you ran anvi-run-kegg-kofams with the --include-stray-KOs flag doesn't mean you have to use any resulting annotations downstream for anvi-estimate-metabolism. If you annotate stray KOs, you can leave those annotations out of the metabolism estimates by simply not including the flag. Sometimes (depending on what stray KOs are annotated, if any) this can lead to anvi-estimate-metabolism not recognizing some KO accessions in the annotations and stop with an error message (because if it isn't told to use stray KOs, it doesn't load their info and a sanity check may be triggered as a result). To get around the error message while still ignoring stray KO annotations, simply add the new flag --ignore-unknown-KOs.

Note that there is no sanity check for doing things the other way around: you can run anvi-estimate-metabolism with the --include-stray-KOs flag even if you haven't annotated the database with stray KOs. More on this in the next section.

Caveats and consequences

In tests, we've noticed that generally very few stray KOs are annotated. In my opinion, this is evidence that the estimated thresholds are quite conservative: that is, since the models are built from a very limited set of genes (often specific to a single microbial species), you are most likely to add annotations if your genome of interest includes a gene was was used to create a stray KO model.

One caveat, then, is that if you annotate using these stray KO models and you don't get an annotation to a function of interest, it doesn't necessarily mean that this function is not present -- perhaps we were just too conservative to annotate it. It is difficult to prove absence of function in biology in general, but especially with this conservative strategy. And it's good to keep this in mind for metabolism analysis as well, since that depends on the annotations.

When you look at the metabolism estimation results that were generated with the --include-stray-KOs flag, note that any anvi'o versions of KO models will not be indicated in most data columns (especially enzyme_hits_in_module), but instead will be mentioned in the warnings column (ie, "used 'K10919_anvio_version' model to annotate K10919").

Also, anvi-estimate-metabolism can only take advantage of stray KO annotations if you actually annotated them. It is possible to annotate your database without including stray KOs and later try to run anvi-estimate-metabolism --include-stray-KOs (there is no sanity check against it), but doing so won't change the metabolism estimation results except to remove the warnings about missing profiles from the warnings column (which could end up being misleading if you've come to expect that an unannotated KO without a warning is truly not present in the genome).

The new KEGG snapshot including stray KOs

We generated a new KEGG snapshot, v2024-03-09, which is the new default snapshot if you run anvi-setup-kegg-data without any arguments. It contains KOfam and KEGG MODULES data, but not the reaction network data (an update to this codebase is coming soon, and Sam told me that including this data was not necessary). It also contains the models and estimated thresholds for 1,302 stray KOs.

Only a small fraction of stray KOs got new "anvi'o versions" of their HMMs in this snapshot. There are 340 new HMMs, and the remaining 962 stray KOs use their original KEGG KOfam models. The models (both new and original) can be found in the orphan_data/anvio_hmm_profiles_for_stray_KOs.hmm file and their estimated bitscore thresholds can be found in the orphan_data/estimated_thresholds_for_stray_kos.txt file.

481 KEGG modules are included in the MODULES.db in this snapshot, which has the hash value 23910d68b4f2. Of these, 55 modules contain at least one stray KO (only 30 with stray KOs using new 'anvio_version' HMMs) and could therefore be affected by using the --include-stray-KOs flag.

Anyone using anvio-dev who wants to try out this new feature would have to run anvi-setup-kegg-data to get this latest snapshot (those who already have an existing KEGG data directory should probably use --kegg-data-dir to avoid messing up any existing projects that use the previous version of KEGG).

A couple of examples using --include-stray-KOs

Here, I copy the results of some of my tests, to both demonstrate that the new strategy works and to show people how to use it.

I tested this on a reference S. aurantiaca genome [from NCBI(https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_900109545.1/), which I suspected would include functions for the KEGG module M00848 (Aurachin biosynthesis), a module that involves several stray KOs. Here is how I annotated this genome:

anvi-script-reformat-fasta --prefix S_aurantiaca_GCF_900109545 --simplify-names ncbi_dataset/data/GCF_900109545.1/GCF_900109545.1_IMG-taxon_2693429895_annotated_assembly_genomic.fna -o S_aurantiaca_GCF_900109545.fa
anvi-gen-contigs-database -f S_aurantiaca_GCF_900109545.fa -o S_aurantiaca-CONTIGS.db -T 5
anvi-run-kegg-kofams -c S_aurantiaca-CONTIGS.db --kegg-data-dir ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/ --include-stray-KOs -T 6

And here is the relevant output of anvi-run-kegg-kofams, which indicates that 10 stray KOs were annotated in this genome:

Number of hits in annotation dict  ...........: 11,319
Number of weak hits removed by Stray KO parser : 11,313
Number of annotations added for Stray KO .....: 6
(...)
Number of decent hits added back after relaxing bitscore threshold : 515
... of these, number of regular KOs is .......: 511
... of these, number of stray KOs is .........: 4

I then estimated metabolism with and without using the --include-stray-KOs flag, and compared the two results:

anvi-estimate-metabolism -c S_aurantiaca-CONTIGS.db --kegg-data-dir ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/ --include-stray-KOs -O test_w_strays
anvi-estimate-metabolism -c S_aurantiaca-CONTIGS.db --kegg-data-dir ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/ -O test_no_strays --ignore-unknown-KOs
diff test_no_strays_modules.txt test_w_strays_modules.txt

The comparison showed the following:

204,205c204,205
< M00815    S_aurantiaca_GCF_900109545  Validamycin A biosynthesis, sedoheptulopyranose-7P => validamycin A Pathway modules Biosynthesis of other secondary metabolites Biosynthesis of other antibiotics   "K19969 K20431 K20432 K20433 K20434 K20435 K20436 K20437 K20438"    0.6666666666666666  False   0.6666666666666666  False   1.0 K20432,K20433,K20434,K20435,K20436  1,1,1,1,1   K19969,K20432,K20433,K20434,K20435,K20436   97,6676,109,6673,6674,6675  K19969 is present in multiple modules: M00815/M00814,No KOfam profile for K20438
< M00848    S_aurantiaca_GCF_900109545  Aurachin biosynthesis, anthranilate => aurachin A   Pathway modules Biosynthesis of other secondary metabolites Biosynthesis of other antibiotics   "K09460 K02078+K14245+K14246+K22798 K22799 K22800 K21272 K21271 K22801 K22802"  0.0 False   0.03125 False   1.0 K02078  5   K02078,K02078,K02078,K02078,K02078  1269,3947,6168,3521,4402    No KOfam profile for K09460,No KOfam profile for K14245,No KOfam profile for K14246,No KOfam profile for K21271,No KOfam profile for K21272,No KOfam profile for K22798,No KOfam profile for K22799,No KOfam profile for K22800,No KOfam profile for K22801,No KOfam profile for K22802
---
> M00815    S_aurantiaca_GCF_900109545  Validamycin A biosynthesis, sedoheptulopyranose-7P => validamycin A Pathway modules Biosynthesis of other secondary metabolites Biosynthesis of other antibiotics   "K19969 K20431 K20432 K20433 K20434 K20435 K20436 K20437 K20438"    0.6666666666666666  False   0.6666666666666666  False   1.0 K20432,K20433,K20434,K20435,K20436  1,1,1,1,1   K19969,K20432,K20433,K20434,K20435,K20436   97,6676,109,6673,6674,6675  K19969 is present in multiple modules: M00815/M00814
> M00848    S_aurantiaca_GCF_900109545  Aurachin biosynthesis, anthranilate => aurachin A   Pathway modules Biosynthesis of other secondary metabolites Biosynthesis of other antibiotics   "K09460 K02078+K14245+K14246+K22798 K22799 K22800 K21272 K21271 K22801 K22802"  0.25    False   0.34375 False   1.0 K02078,K14246,K22798,K22799,K22801  5,1,1,1,1   K02078,K02078,K02078,K02078,K02078,K14246,K22798,K22799,K22801  1269,3947,6168,3521,4402,3519,3518,3522,5018    None

That is, only two modules had differences in the anvi-estimate-metabolism output. The module M00815 was different only in the warnings column (it lost warnings about missing KOfam profiles when using --include-stray-KOs), and as expected, the module M00848 went from a completeness score of ~0.0 to a score of ~0.3 when the stray KO annotations were included.

That test didn't involve any stray KOs using new HMMs, so I also tested on a [Vibrio cholerae genome from NCBI](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006745.1/), which should contain M00852, the Vibrio cholerae pathogenicity signature module and a module that involves several stray KOs with anvi'o-generated HMMs.

anvi-script-reformat-fasta --prefix V_cholerae_GCF_000006745 --simplify-names --seq-type NT ncbi_dataset/data/GCA_000006745.1/GCA_000006745.1_ASM674v1_genomic.fna -o V_cholerae_GCF_000006745.fa
anvi-gen-contigs-database -f V_cholerae_GCF_000006745.fa -o V_cholerae-CONTIGS.db -T 5
anvi-run-kegg-kofams -c V_cholerae-CONTIGS.db --kegg-data-dir ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/ --include-stray-KOs -T 6
while read k; do grep $k ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/orphan_data/estimated_thresholds_for_stray_kos.txt; done < <(sqlite3 V_cholerae-CONTIGS.db "select * from gene_functions where source='KOfam'" | cut -d '|' -f 3)

Indeed, 6 annotations from stray KOs are found by anvi-run-kegg-kofams and all of them are annotated with anvi'o versions of the family's HMM:

K10963_anvio_version    202.3   full    toxin coregulated pilus biosynthesis protein R
K10919_anvio_version    240.8   full    toxin coregulated pilus biosynthesis protein H
K10966_anvio_version    570.3   full    toxin coregulated pilus biosynthesis protein J [EC:3.4.23.43 2.1.1.-]
K10923_anvio_version    408.0   full    AraC family transcriptional regulator, TCP pilus virulence regulatory protein
K10961_anvio_version    1450.1  full    toxin coregulated pilus biosynthesis protein I
K10920_anvio_version    526.9   full    toxin coregulated pilus biosynthesis protein P

Comparing the metabolism output with and without including the stray KOs like this:

anvi-estimate-metabolism -c V_cholerae-CONTIGS.db --kegg-data-dir ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/ --include-stray-KOs -O test_w_strays
anvi-estimate-metabolism -c V_cholerae-CONTIGS.db --kegg-data-dir ~/Lab/test-kegg/KEGG_2024-03-09_23910d68b4f2/ -O test_no_strays
diff test_no_strays_modules.txt test_w_strays_modules.txt
diff <(cut -f 1-16 test_no_strays_modules.txt ) <(cut -f 1-16 test_w_strays_modules.txt )

shows that 2 modules have differences only in the warnings column -- M00848 (Aurachin biosynthesis) and M00576 (ETEC pathogenicity signature) -- while only M00852 has a difference in completeness score. M00852 goes from a score of 0.64 (without the stray KO annotations) to a score of 1.0 (with strays included).

ivagljiva commented 6 months ago

I am pinging @Kekananen here since this is relevant to a project we are working on :)

Kekananen commented 6 months ago

@ivagljiva There seems to be a minor issue in the anvio/__init__.py file where the key wasn't added to the dictionary. I fixed it on my local version by slapping this in:

    'include-stray-KOs': (
            ['include-stray-KOs'],
            {'default': False,
             'action': 'store_true',
             'help': "Attempts to rescue the stray KO"}
                ),

Could be something on my end with the wonky way I'm running it, but just incase thought I would let you know. Besides that, it's running like a champ! Thanks for getting this implemented in.

ivagljiva commented 6 months ago

Oh, that's weird @Kekananen! I added this parameter to the dictionary very early on and I still see it in the code:

image

but at any rate, you fixed it on your local in the right way :) I'm glad it's running!

meren commented 6 months ago

This is a lot of work, Iva. I've been going through it on and off in various sits today. Thank you very much for your diligence in extensive documentation but in anvi'o docs and function headers and for your clean code :) All of this looks very good to me and ready to go!