[BUG] Genes with multiple annotations overinflate stepwise pathway copy number values

ivagljiva commented 8 months ago

Short description of the problem

The stepwise pathway copy number calculation doesn't take into account when the same gene call has more than one functional annotation, instead counting all annotations separately to get artificially high copy numbers.

anvi'o version

This applies to anvi'o v8 and dev.

Detailed description of the issue

When computing copy numbers for metabolic pathways, if there are multiple enzyme options for a given step and a gene is annotated with more than one of those options, we count the number of unique hits rather than the number of unique gene calls.

This bug is especially relevant to user-defined pathways using functional sources other than KOfams (because most KEGG Orthologs are specific enough that you won't get more than one KOfam annotation on a given gene - there are exceptions to this, but likely not enough to make a huge difference in the copy numbers overall -- therefore, copy numbers of KEGG Modules are likely fine).

To fix this, we need to de-duplicate the gene calls within our set of relevant annotations before doing the copy number calculation.

Note that the pathwise copy number calculation is fine because it considers only one alternative enzyme at a time, so we can't have double counting (the same gene should not be annotated twice with the same enzyme accession).

Files / commands to reproduce the issue

I discovered this when running the following tutorial section: https://anvio.org/tutorials/fmt-mag-metabolism/#user-defined-pathways

You can run this command to get copy numbers for the custom pathway in the A. muciniphila genome:

anvi-estimate-metabolism -c A_muciniphila-CONTIGS.db                          
    -u CUSTOM_PATHWAYS/  \                   
    --only-user-modules  \               
    -O custom  \     
    --output-modes modules,hits,module_steps,module_paths \
    --add-copy-number \

In the custom_module_steps.txt output file, you can see that step ID 3 ((K01190,K12111,COG3250)) has a copy number of 13, but when you look at the 13 hits to the enzymes for that step, you can see that several of the annotations have the same gene caller ID:

for a in K01190 K12111 COG3250; do grep $a custom_hits.txt ; done

enzyme	genome_name	gene_caller_id	contig	modules_with_enzyme	enzyme_definition
K01190	A_muciniphila	871	c_000000000001	MD0001.txt	beta-galactosidase [EC:3.2.1.23]
K01190	A_muciniphila	589	c_000000000001	MD0001.txt	beta-galactosidase [EC:3.2.1.23]
K01190	A_muciniphila	1838	c_000000000001	MD0001.txt	beta-galactosidase [EC:3.2.1.23]
K01190	A_muciniphila	1839	c_000000000001	MD0001.txt	beta-galactosidase [EC:3.2.1.23]
K01190	A_muciniphila	1840	c_000000000001	MD0001.txt	beta-galactosidase [EC:3.2.1.23]
K01190	A_muciniphila	346	c_000000000001	MD0001.txt	beta-galactosidase [EC:3.2.1.23]
COG3250	A_muciniphila	871	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase
COG3250	A_muciniphila	589	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase
COG3250	A_muciniphila	1838	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase
COG3250	A_muciniphila	1839	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase
COG3250	A_muciniphila	1840	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase
COG3250	A_muciniphila	917	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase
COG3250	A_muciniphila	346	c_000000000001	MD0001.txt	Beta-galactosidase/beta-glucuronidase

For instance, gcids 871, 589, 1838, 1839, 1840 and 346 are each annotated with both K01190 and COG3250 because these two enzyme annotations appear to be synonyms. Only one additional gene call is unduplicated: gcid 917 with a hit to only COG3250.

Instead, the calculation should take into account that these gcids are duplicated, and only count each one once. This means the true copy number should be 7; that is, the number of hits to unique gene calls annotated with either K01190, K12111 or COG3250.

ivagljiva commented 8 months ago

I started working on this in branch iss2173, and implemented a very simple fix, but it started getting complicated in the testing phase. I thought I'd record what is going on here :)

What I tried

Previously, we were counting the number of hits to each enzyme accession prior to calculating stepwise copy number, without considering whether any of those hits were to the same gene call. So I added a function that reduces the count for any duplicate annotation on a given gene. It identifies the list of enzyme annotations for each gene call, arbitrarily leaves the first enzyme alone, and removes one hit for each of the remaining enzyme accessions. This means that each gene call with more than one annotation is only counted once.

I tested on the same data in which I found this issue (see #2173 for details), and the new per-step copy numbers are what I expect. However, since the fix is simplistic and relies on the assumption that each annotation to the same gene is equivalent to each other within the metabolic pathway, there are some edge cases that we could be missing.

Edge cases I found

I therefore also tested it on all of the standard KEGG modules to identify whether we get copy number differences for any of those, to see if we can find some of the edge cases. There were 8 modules affected with slightly reduced per-step copy numbers after the fix (but only in one case did that actually contribute to a different overall stepwise copy number, since overall copy is determined by the minimum over all steps). It is rare for multiple KOs to be annotated on the same gene, so I went through the affected modules just to ensure nothing is going wrong. In the process, I found two edge cases as well as a scary bug.

Multiple annotations from enzymes that are not direct alternatives

For module M00023, in the step (((K01657+K01658,K13503,K13501,K01656) K00766),K13497) the enzymes K01658, K13497, and K00766 were all annotated for the same gene call. This is a bit unusual because these are not direct alternatives; in the path K01657+K01658 K00766, both K01658 and K00766 are required, so when the same gene fulfills both, things get a bit weird. Without the fix, we get a count of 2 for this step (1 from the former alternative and 1 from K13497), but with the dereplication, we get a count of 0. However, I would argue that the TRUE copy number of this step is, in fact, 1. We should count EITHER the K13497 hit, OR the K01658 and K00766, not both.

This also happens for module M00024. In the step ((K01850,K04092,K14187,K04093,K04516,K06208,K06209) (K01713,K04518,K05359),K14170) there are two annotations to the same gene, K14170 and K04518. This is again a complicated step, and the simple fix I implemented doesn't work well. K14170 fulfills the step as part of a one-reaction branch while K04518 requires extra help since its in a two-reaction branch of the pathway. We should get a step copy number of 1 because of K14170. Instead, we get a copy number of 0, probably because my fix drops the annotation for K14170 arbitrarily since it goes through K04518 first. This makes it clear that the choice of which enzyme hits to keep makes a difference in the calculation -- arbitrarily picking one doesn't work for indirect alternatives.

The implications of this edge case is that multiple annotations to the same gene can have different priority in complicated steps. The outermost enzymes (ie, K13497 or K14170 above) should be counted first, and then the hit counts to the other alternatives could be reduced. But when the alternatives have the same priority level and are not direct alternatives (as in K01658 and K00766 above), we should not reduce the hit count to either of them.

Multiple annotations from enzymes that contribute to different steps in the pathway

So far we've been looking within a given step. But with module M00846, I found a case where one gene had three annotations - K01719, K13542, and K02303. The first two were alternatives in the same step ((K01719,K13542,K13543)), while the last one was an option in a different step which did not include the other two ((K02302,(K00589,K02303,K02496,K13542,K13543)+(K02304,(K24866+K03794)))). In my test data, that last step had a copy number of 0 because none of the other enzymes were annotated, but IF they were, should we count the hit to K02303 even though it is used earlier?

EDIT: I realized later that K13542 is also in the second step, as a direct alternative to K02303. But I think this edge case still holds and that steps should be dereplicated independently.

If yes, I would have to implement a step-specific counting of enzyme hits (rather than the global counting of all enzymes in the pathway currently done before any copy numbers are calculated).

Note that these three enzymes are direct alternatives in a few of the other modules (M00121 and M00926 include both K01719 and K13542, M00925 includes both K02303 and K13542), but M00846 is the only pathway in which all three are present (and on different steps).

Bonus bug fix

For modules M00009 and M00011 (two versions of the same pathway), there are two alternative enzyme complexes (for succinyl-CoA synthetase) in one step: (K01902+K01903,K01899+K01900,K18118). It seems that a gene can be annotated with both K01900 and K01903 (two different versions of the beta subunit). So it is okay to not double-count those annotations in this case. However, I was a bit concerned because the original output file was erroneously listing the step copy number as 2. The equation for this step's copy number should be min(K01902,K01903) + min(K01899,K01900) + K18118. Of the enzymes in this step, only 3 were annotated in my test genome (with 2 annotations on the same gene), but the calculation min(1,1) + min(0,1) + 0 = 1, so the output was wrong.

Then I realized what was going on. K01899 was not annotated in the test genome, so we have an enzyme count of 0 for it. But when we compute the copy number of the clause K01899+K01900, we go through the following code:

min_step_count = None
                for s in and_splits:
                    s_count = self.get_step_copy_number(s, enzyme_hit_counts)
                    if min_step_count is not None:
                        min_step_count = s_count # make first step the minimum
                    min_step_count = min(min_step_count, s_count)

And since min_step_count = 0 after processing K01899, and the truth value of 0 is False, we set min_step_count to be equal to the hit count of K01900, which is 1 (and then the minimum of (1) is 1, leading us to the false step copy number of 1 + 1 + 0 = 2.

A very insidious bug, which affects any step with alternative enzymes in which the first alternative is unannotated in your data :(

SO. I changed the code to explicitly look for non-None values rather than truth values of False.

if min_step_count is None:
                        min_step_count = s_count # make first step the minimum

This change fixes the calculation, independently of the original issue with multiple annotations (which I checked by going back to the master branch, applying the same change, and testing again). In fact, this is perhaps an even more important bug, as it affects many more pathways. 😅 It was just pure luck that I noticed it while testing other stuff.

ivagljiva commented 8 months ago

I was able to address the second edge case by making the counting of enzyme hits happen on a per-step basis rather than for the module as a whole. When I tested this change, it didn't affect the output, so I think the copy number calculation is still good (which is expected, since it only uses the hit counts for the enzymes in a given step anyway).

But a solution to the first edge case remains elusive. I am now able to detect which enzymes are direct vs indirect alternatives, and the original solution works just find for the direct alternatives, but for the indirect alternatives, I cannot figure out which alternative(s) to keep and which to throw away. Not all edge cases will have an 'outermost' enzyme (or highest priority level enzyme, and it is also complicated to handle the 'AND' cases for enzymes like K01658 and K00766, since they are technically separated by a parentheses in the definition string.

I'm very close to giving up on the first edge case and simply using my original, basic solution.

meren commented 8 months ago

I can see that it would be very difficult to make reliable decisions about the which indirect alternative to keep. I think leaving the basic solution that as is is a good call :/

ivagljiva commented 8 months ago

Okay, I'm glad you agree. It is not ideal, but at least the result for these edge cases is a more conservative copy number rather than an inflated one. After discussions with Alex, I've decided to warn the user about the edge cases so they can adjust copy numbers manually if they need to :)

ivagljiva commented 8 months ago

I tested on the KEGG database with the new warning system in place, and here is an example of what the user sees for these edge cases:

WARNING
===============================================
The gene call 7 has multiple annotations to alternative enzymes within the same
step of a metabolic pathway (K04518, K14170), and these enzymes unfortunately
have a complex relationship. The affected module is M00024, and here is the step
in question: ((K01850,K04092,K14187,K04093,K04516,K06208,K06209)
(K01713,K04518,K05359),K14170). We arbitrarily kept only one of the annotations
to this gene in order to avoid inflating the step's copy number, but due to the
complex relationship between these alternatives, this could mean that the copy
number for this step is actually too low. Please heed this warning and double
check the stepwise copy number results for M00024 and other pathways containing
gene call 7.

I got four of these warnings for the current set of KEGG modules, as expected from the previous tests: M00009 and M00011 (same gene for these two), M00023, and M00024.

merenlab / anvio