In this PR, I implemented the fix described for the bug in issue #2217. I ran some tests to verify that it worked.
In a test of my initial synthetic metagenome dataset, I compared the pre-fix anvi-estimate-metabolism output to the post-fix output. That is, I ran the following:
And used diff to compare the outputs. Besides the warnings column (which includes warnings in random order and is therefore different across different runs of the program even with the same input data), the only columns with different data between the two outputs are the stepwise_copy_number (column 19) and per_step_copy_numbers:
And it seems clear that this fits into the expected case of a module that would be affected by this bug, since it has not one but TWO instances of interior parenthetical clauses preceded and followed by a substep (one is the substring K00330+(K00331+K00332+K00333,K00331+K13378,K13380)+K00334, and the other is the substring K00340+(K00341+K00342)+K00343. I am guessing this led to the extreme inflation of the copy number from 7 to 24.
Module M00149 also fits into this case:
K00241+(K00242,K18859,K18860)+K00239+K00240
I double checked the computation for M00149 manually. There were 17 hits to K00239, 19 hits to K00240, 12 to K00241, and 6 to K00242. This leads to a computation like this:
I happened to have a bunch of other synthetic metagenomes on hand, and I ran the same test for all of them to see how many modules had changed copy numbers post-fix:
# in branch `master`
for db in 02_CONTIGS/*.db; do \
file=$(basename $db);
name=${file%--contigs.db};
anvi-estimate-metabolism -c $db -O PRE_FIX_${name} --kegg-data-dir ~/Lab/test-kegg/KEGG_2020-12-23_45b7cc2e4fdc/ --add-copy-number; \
done
# in branch `iss2217` (this branch)
for db in 02_CONTIGS/*.db; do \
file=$(basename $db);
name=${file%--contigs.db};
anvi-estimate-metabolism -c $db -O POST_FIX_${name} --kegg-data-dir ~/Lab/test-kegg/KEGG_2020-12-23_45b7cc2e4fdc/ --add-copy-number; \
done
for db in 02_CONTIGS/*.db; do \
file=$(basename $db);
name=${file%--contigs.db};
diff <(cut -f 1,19- PRE_FIX_${name}_modules.txt ) <(cut -f 1,19- POST_FIX_${name}_modules.txt ) | grep -c '>'; \
diff <(cut -f 1,19- PRE_FIX_${name}_modules.txt ) <(cut -f 1,19- POST_FIX_${name}_modules.txt ) | grep '>' | cut -d ' ' -f 2 | cut -f 1 >> modules_affected.txt; \
done
sort -u modules_affected.txt
I could see that in each metagenome, between 1 and 3 modules were affected by the bug, and the output of the sort -u command shows that these include only M00083, M00144, and M00149 (as we saw in the initial test).
So the fix seems to be working, and moreover only 3 modules were affected by the bug in my synthetic dataset (which was made up of fake metagenomes generated from random genomes in GTDB) :)
In this PR, I implemented the fix described for the bug in issue #2217. I ran some tests to verify that it worked.
In a test of my initial synthetic metagenome dataset, I compared the pre-fix
anvi-estimate-metabolism
output to the post-fix output. That is, I ran the following:And used
diff
to compare the outputs. Besides the warnings column (which includes warnings in random order and is therefore different across different runs of the program even with the same input data), the only columns with different data between the two outputs are thestepwise_copy_number
(column 19) andper_step_copy_numbers
:The comparison shows that only 3 modules have different copy number after the fix:
This includes M00083, which (as discussed in the associated issue), now has the expected copy number of 15. :)
The module M00144 has the following definition:
And it seems clear that this fits into the expected case of a module that would be affected by this bug, since it has not one but TWO instances of interior parenthetical clauses preceded and followed by a substep (one is the substring
K00330+(K00331+K00332+K00333,K00331+K13378,K13380)+K00334
, and the other is the substringK00340+(K00341+K00342)+K00343
. I am guessing this led to the extreme inflation of the copy number from 7 to 24.Module M00149 also fits into this case:
I double checked the computation for M00149 manually. There were 17 hits to K00239, 19 hits to K00240, 12 to K00241, and 6 to K00242. This leads to a computation like this:
And the post-fix output is correct! 🎉
Larger test
I happened to have a bunch of other synthetic metagenomes on hand, and I ran the same test for all of them to see how many modules had changed copy numbers post-fix:
I could see that in each metagenome, between 1 and 3 modules were affected by the bug, and the output of the
sort -u
command shows that these include only M00083, M00144, and M00149 (as we saw in the initial test).So the fix seems to be working, and moreover only 3 modules were affected by the bug in my synthetic dataset (which was made up of fake metagenomes generated from random genomes in GTDB) :)