The get_step_copy_number() function doesn't handle complex clauses correctly in certain modules
This issue is too specific to summarize quickly, but TL;DR is basically: we are combining copy numbers of sub-clauses incorrectly (adding them when it should be a more nuanced minimum() operation) for modules that have 3 subclauses including a parenthetical one as the middle clause.
Yeah. It makes certain copy numbers higher than they should be. See detailed description.
anvi'o version
This bug affects anvi'o v8. I will fix it in anvio-dev imminently, but of course it will affect any anvi-estimate-metabolism --add-copy-number output generated by anvio-dev between whenever I created this function (around April 2022) and whenever I close this issue.
Detailed description of the issue
I tested my algorithm on a dataset created by combining ~10 individual genomes into one synthetic metagenome (literally just concatenated fasta files), and using an older KEGG snapshot with modules db hash of 45b7cc2e4fdc. I noticed that for a particular KEGG module, M00083, the estimated stepwise copy number was much higher than it should be. Specifically, the combination of KEGG Ortholog annotations across all 10 individual genomes should have led to a stepwise copy number of 15, but the copy number reported by anvi-estimate-metabolism was 23.
This value of 23 was coming from the function get_step_copy_number() in kegg.py, which is run on the module's definition string and which recursively computes and combines copy numbers of sub-steps within the module to get an overall copy number (following the algorithm described here: https://anvio.org/help/main/programs/anvi-estimate-metabolism/#part-5-step-copy-number).
What is currently happening in the code?
After some debugging with my synthetic dataset, I realized the problem. In get_step_copy_number(), if the input step contains parentheses, what we do is this:
establish a value, step_copy_num to hold the copy number of the step as a whole (which is initialized to 0)
recursively call the function on the string between the outermost parentheses to get its copy number (variable sub_copy_num)
recursively call the function on the string before the outermost parentheses (if any) to get its copy number (variable prev_copy)
combine prev_copy with sub_copy_num according to whether there is a , or a ` before the parenthetical string, and add the resulting value tostep_copy_num(which should be 0 at this point, so effectivelystep_copy_num` will be set to the combined copy number value, if there is one. Otherwise it remains at its initial value of 0)
recursively call the function on the string after the outermost parentheses (if any) to get its copy number (variable post_copy)
combine post_copy with sub_copy_num according to whether there is a , or a ` before the parenthetical string, and add the resulting value tostep_copy_num` (which could either be 0, if there was no string before the parenthetical clause, or the overall copy number of the parenthetical clause and the string before the clause)
Why is it wrong?
The problem is this part:
add the resulting value to step_copy_num
this works for the OR case (commas), since we always add the component copy numbers for that case. However, the AND case has an addition of the min() value, not an overall min() operation. Like this:
if combo_element == ' ' or combo_element == '+': # AND
step_copy_num += min(sub_copy_num,post_copy)
I believe that I initially implemented the addition here because we don't want to do an overall min() if the step_copy_number remains initialized to 0 in the case when there is no previous substring in front of the parenthetical clause (since then the overall copy number would incorrectly be 0). However, when that is NOT the case, then step_copy_number represents the combined copy number of the previous clauses in the step definition and we should do an overall minimum instead, like this: step_copy_num = min(step_copy_num,post_copy).
The bug instead incorrectly turns this AND case into an OR case (and one that considers sub_copy_num twice: once within step_copy_num and once within the min() operation)
The example of M00083
For M00083, the calculation with my input data should have looked like this:
However, when this function was working on the complex clause ((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)), FIRST it handled the (K00647,K09458) part to get a copy number of 9+13=22 (a correct value).
And later it worked on the subsequent substring K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209), which is an input string that contains parentheses. And therefore it did the following:
it first estimated on the parenthetical clause (K02372,K01716,K16363) (for which it correctly got a copy number of 18)
and then it estimated on the previous substring K00059 (for which it correctly got a copy number of 143)
and then it combined the two using a min() operation, since there is a space between the two clauses, to get a value of 18, which was added to the initial 0 of step_copy_num to make that variable hold the value 18
and then it estimated on the substring after the parenthetical clause, (K00208,K02371,K10780,K00209), for which it correctly got a copy number of 14
(this is where things break down) it combined the two using a min() operation to get 14, but then added that to the current value of step_copy_num to make that variable hold the value of 32
downstream, combining this copy number with the first parenthetical clause of (K00647,K09458) lead to a min(22,32) = 22, which is how the overall copy number ended up being 1 + 22 = 23
How to fix it?
I can fix this bug by replacing the current faulty logic with the following:
we initialize step_copy_num to None instead of 0
we recursively obtain the copy number of the parenthetical clause
if there is a previous clause, we recursively obtain prev_copy as before
but now we DON'T add to step_copy_num (which should be None at this point anyway), instead we set step_copy_num to be the combined value of prev_copy and sub_copy_num (whether that is an addition or a min() value, it doesn't matter).
if there is a subsequent clause, we recursively obtain post_copy as before
now we split the logic into 2 x 2 cases.
IF there was a previous clause (step_copy_num is not None and represents the combined copy number of the previous two substrings), we can either:
add post_copy to that value (OR case)
take the overall minimum of that value and post_copy (AND case)
ELSE (step_copy_num is None), we can either:
we set step_copy_num equal to the added combo of sub_copy_num and post_copy (OR case)
we set step_copy_num equal to the min() of sub_copy_num and post_copy (AND case)
I will try that now, and re-run my tests to see if it fixes the problem.
What are the consequences of the bug?
This bug incorrectly inflates stepwise copy number. It should only affect the calculation of stepwise copy number for modules that have complex clauses involving all three of the following: a parenthetical clause, a 'pre' string before that clause, and a 'post' string before that clause. In Module M00083, that happens within the step ((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)).
I don't know yet how many other modules like this that could be affected, but I will run some comparisons after I fix the bug and include this information in the eventual PR.
The
get_step_copy_number()
function doesn't handle complex clauses correctly in certain modulesThis issue is too specific to summarize quickly, but TL;DR is basically: we are combining copy numbers of sub-clauses incorrectly (adding them when it should be a more nuanced minimum() operation) for modules that have 3 subclauses including a parenthetical one as the middle clause.
Yeah. It makes certain copy numbers higher than they should be. See detailed description.
anvi'o version
This bug affects anvi'o
v8
. I will fix it inanvio-dev
imminently, but of course it will affect anyanvi-estimate-metabolism --add-copy-number
output generated byanvio-dev
between whenever I created this function (around April 2022) and whenever I close this issue.Detailed description of the issue
I tested my algorithm on a dataset created by combining ~10 individual genomes into one synthetic metagenome (literally just concatenated fasta files), and using an older KEGG snapshot with modules db hash of
45b7cc2e4fdc
. I noticed that for a particular KEGG module, M00083, the estimated stepwise copy number was much higher than it should be. Specifically, the combination of KEGG Ortholog annotations across all 10 individual genomes should have led to a stepwise copy number of 15, but the copy number reported byanvi-estimate-metabolism
was 23.This value of 23 was coming from the function
get_step_copy_number()
inkegg.py
, which is run on the module's definition string and which recursively computes and combines copy numbers of sub-steps within the module to get an overall copy number (following the algorithm described here: https://anvio.org/help/main/programs/anvi-estimate-metabolism/#part-5-step-copy-number).What is currently happening in the code?
After some debugging with my synthetic dataset, I realized the problem. In
get_step_copy_number()
, if the input step contains parentheses, what we do is this:step_copy_num
to hold the copy number of the step as a whole (which is initialized to 0)sub_copy_num
)prev_copy
)prev_copy
withsub_copy_num
according to whether there is a,
or a` before the parenthetical string, and add the resulting value to
step_copy_num(which should be 0 at this point, so effectively
step_copy_num` will be set to the combined copy number value, if there is one. Otherwise it remains at its initial value of 0)post_copy
)post_copy
withsub_copy_num
according to whether there is a,
or a` before the parenthetical string, and add the resulting value to
step_copy_num` (which could either be 0, if there was no string before the parenthetical clause, or the overall copy number of the parenthetical clause and the string before the clause)Why is it wrong?
The problem is this part:
this works for the OR case (commas), since we always add the component copy numbers for that case. However, the AND case has an addition of the min() value, not an overall min() operation. Like this:
I believe that I initially implemented the addition here because we don't want to do an overall min() if the step_copy_number remains initialized to 0 in the case when there is no previous substring in front of the parenthetical clause (since then the overall copy number would incorrectly be 0). However, when that is NOT the case, then
step_copy_number
represents the combined copy number of the previous clauses in the step definition and we should do an overall minimum instead, like this:step_copy_num = min(step_copy_num,post_copy)
.The bug instead incorrectly turns this AND case into an OR case (and one that considers
sub_copy_num
twice: once withinstep_copy_num
and once within themin()
operation)The example of M00083
For M00083, the calculation with my input data should have looked like this:
However, when this function was working on the complex clause
((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209))
, FIRST it handled the(K00647,K09458)
part to get a copy number of 9+13=22 (a correct value).And later it worked on the subsequent substring
K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209)
, which is an input string that contains parentheses. And therefore it did the following:(K02372,K01716,K16363)
(for which it correctly got a copy number of 18)K00059
(for which it correctly got a copy number of 143)step_copy_num
to make that variable hold the value 18(K00208,K02371,K10780,K00209)
, for which it correctly got a copy number of 14step_copy_num
to make that variable hold the value of 32(K00647,K09458)
lead to a min(22,32) = 22, which is how the overall copy number ended up being 1 + 22 = 23How to fix it?
I can fix this bug by replacing the current faulty logic with the following:
step_copy_num
toNone
instead of 0prev_copy
as beforestep_copy_num
(which should be None at this point anyway), instead we setstep_copy_num
to be the combined value ofprev_copy
andsub_copy_num
(whether that is an addition or a min() value, it doesn't matter).post_copy
as beforestep_copy_num
is notNone
and represents the combined copy number of the previous two substrings), we can either:post_copy
to that value (OR case)post_copy
(AND case)step_copy_num
isNone
), we can either:step_copy_num
equal to the added combo ofsub_copy_num
andpost_copy
(OR case)step_copy_num
equal to the min() ofsub_copy_num
andpost_copy
(AND case)I will try that now, and re-run my tests to see if it fixes the problem.
What are the consequences of the bug?
This bug incorrectly inflates stepwise copy number. It should only affect the calculation of stepwise copy number for modules that have complex clauses involving all three of the following: a parenthetical clause, a 'pre' string before that clause, and a 'post' string before that clause. In Module M00083, that happens within the step
((K00647,K09458) K00059 (K02372,K01716,K16363) (K00208,K02371,K10780,K00209))
.I don't know yet how many other modules like this that could be affected, but I will run some comparisons after I fix the bug and include this information in the eventual PR.