merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[BUG] Incorrect (inflated) stepwise copy number for modules with complex parenthetical clauses #2217

Closed ivagljiva closed 4 months ago

ivagljiva commented 4 months ago

The get_step_copy_number() function doesn't handle complex clauses correctly in certain modules

This issue is too specific to summarize quickly, but TL;DR is basically: we are combining copy numbers of sub-clauses incorrectly (adding them when it should be a more nuanced minimum() operation) for modules that have 3 subclauses including a parenthetical one as the middle clause.

Yeah. It makes certain copy numbers higher than they should be. See detailed description.

anvi'o version

This bug affects anvi'o v8. I will fix it in anvio-dev imminently, but of course it will affect any anvi-estimate-metabolism --add-copy-number output generated by anvio-dev between whenever I created this function (around April 2022) and whenever I close this issue.

Detailed description of the issue

I tested my algorithm on a dataset created by combining ~10 individual genomes into one synthetic metagenome (literally just concatenated fasta files), and using an older KEGG snapshot with modules db hash of 45b7cc2e4fdc. I noticed that for a particular KEGG module, M00083, the estimated stepwise copy number was much higher than it should be. Specifically, the combination of KEGG Ortholog annotations across all 10 individual genomes should have led to a stepwise copy number of 15, but the copy number reported by anvi-estimate-metabolism was 23.

This value of 23 was coming from the function get_step_copy_number() in kegg.py, which is run on the module's definition string and which recursively computes and combines copy numbers of sub-steps within the module to get an overall copy number (following the algorithm described here: https://anvio.org/help/main/programs/anvi-estimate-metabolism/#part-5-step-copy-number).

What is currently happening in the code?

After some debugging with my synthetic dataset, I realized the problem. In get_step_copy_number(), if the input step contains parentheses, what we do is this:

  1. establish a value, step_copy_num to hold the copy number of the step as a whole (which is initialized to 0)
  2. recursively call the function on the string between the outermost parentheses to get its copy number (variable sub_copy_num)
  3. recursively call the function on the string before the outermost parentheses (if any) to get its copy number (variable prev_copy)
  4. combine prev_copy with sub_copy_num according to whether there is a , or a ` before the parenthetical string, and add the resulting value tostep_copy_num(which should be 0 at this point, so effectivelystep_copy_num` will be set to the combined copy number value, if there is one. Otherwise it remains at its initial value of 0)
  5. recursively call the function on the string after the outermost parentheses (if any) to get its copy number (variable post_copy)
  6. combine post_copy with sub_copy_num according to whether there is a , or a ` before the parenthetical string, and add the resulting value tostep_copy_num` (which could either be 0, if there was no string before the parenthetical clause, or the overall copy number of the parenthetical clause and the string before the clause)

Why is it wrong?

The problem is this part:

add the resulting value to step_copy_num

this works for the OR case (commas), since we always add the component copy numbers for that case. However, the AND case has an addition of the min() value, not an overall min() operation. Like this:

if combo_element == ' ' or combo_element == '+': # AND
                    step_copy_num += min(sub_copy_num,post_copy)

I believe that I initially implemented the addition here because we don't want to do an overall min() if the step_copy_number remains initialized to 0 in the case when there is no previous substring in front of the parenthetical clause (since then the overall copy number would incorrectly be 0). However, when that is NOT the case, then step_copy_number represents the combined copy number of the previous clauses in the step definition and we should do an overall minimum instead, like this: step_copy_num = min(step_copy_num,post_copy).

The bug instead incorrectly turns this AND case into an OR case (and one that considers sub_copy_num twice: once within step_copy_num and once within the min() operation)

The example of M00083

For M00083, the calculation with my input data should have looked like this:

K00665,(K00667 K00668),K11533,((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)) K00665 + (min(K00667, K00668)) + K11533 + (min((K00647 + K09458), K00059 (K02372 + K01716 + K16363), (K00208 + K02371 + K10780 + K00209)) 0 + min(0,0) + 1 + min(9+13, 143, 13+2+3, 8+1+2+3) 1 + min(22, 143, 18, 14) 15

However, when this function was working on the complex clause ((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)), FIRST it handled the (K00647,K09458) part to get a copy number of 9+13=22 (a correct value).

And later it worked on the subsequent substring K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209), which is an input string that contains parentheses. And therefore it did the following:

How to fix it?

I can fix this bug by replacing the current faulty logic with the following:

  1. we initialize step_copy_num to None instead of 0
  2. we recursively obtain the copy number of the parenthetical clause
  3. if there is a previous clause, we recursively obtain prev_copy as before
  4. but now we DON'T add to step_copy_num (which should be None at this point anyway), instead we set step_copy_num to be the combined value of prev_copy and sub_copy_num (whether that is an addition or a min() value, it doesn't matter).
  5. if there is a subsequent clause, we recursively obtain post_copy as before
  6. now we split the logic into 2 x 2 cases.
    • IF there was a previous clause (step_copy_num is not None and represents the combined copy number of the previous two substrings), we can either:
    • add post_copy to that value (OR case)
    • take the overall minimum of that value and post_copy (AND case)
    • ELSE (step_copy_num is None), we can either:
    • we set step_copy_num equal to the added combo of sub_copy_num and post_copy (OR case)
    • we set step_copy_num equal to the min() of sub_copy_num and post_copy (AND case)

I will try that now, and re-run my tests to see if it fixes the problem.

What are the consequences of the bug?

This bug incorrectly inflates stepwise copy number. It should only affect the calculation of stepwise copy number for modules that have complex clauses involving all three of the following: a parenthetical clause, a 'pre' string before that clause, and a 'post' string before that clause. In Module M00083, that happens within the step ((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)).

I don't know yet how many other modules like this that could be affected, but I will run some comparisons after I fix the bug and include this information in the eventual PR.

ivagljiva commented 4 months ago

Now fixed, see PR :)