Fix bug in stepwise copy number calculation for issue #2217

In this PR, I implemented the fix described for the bug in issue #2217. I ran some tests to verify that it worked.

In a test of my initial synthetic metagenome dataset, I compared the pre-fix anvi-estimate-metabolism output to the post-fix output. That is, I ran the following:

# in branch `master`
anvi-estimate-metabolism -c 02_CONTIGS/SYNTH_META_0014-contigs.db -O PRE_FIX --kegg-data-dir ~/Lab/test-kegg/KEGG_2020-12-23_45b7cc2e4fdc/ --add-copy-number
# in branch `iss2217` (this branch)
anvi-estimate-metabolism -c 02_CONTIGS/SYNTH_META_0014-contigs.db -O POST_FIX --kegg-data-dir ~/Lab/test-kegg/KEGG_2020-12-23_45b7cc2e4fdc/ --add-copy-number

And used diff to compare the outputs. Besides the warnings column (which includes warnings in random order and is therefore different across different runs of the program even with the same input data), the only columns with different data between the two outputs are the stepwise_copy_number (column 19) and per_step_copy_numbers:

diff <(cut -f 1,19- PRE_FIX_modules.txt ) <(cut -f 1,19- POST_FIX_modules.txt )

The comparison shows that only 3 modules have different copy number after the fix:

75c75
< M00144    24  24
---
> M00144    7   7
77c77
< M00149    12  12
---
> M00149    6   6
90c90
< M00083    23  23
---
> M00083    15  15

This includes M00083, which (as discussed in the associated issue), now has the expected copy number of 15. :)

The module M00144 has the following definition:

K00330+(K00331+K00332+K00333,K00331+K13378,K13380)+K00334+K00335+K00336+K00337+K00338+K00339+K00340+(K00341+K00342)+K00343

And it seems clear that this fits into the expected case of a module that would be affected by this bug, since it has not one but TWO instances of interior parenthetical clauses preceded and followed by a substep (one is the substring K00330+(K00331+K00332+K00333,K00331+K13378,K13380)+K00334, and the other is the substring K00340+(K00341+K00342)+K00343. I am guessing this led to the extreme inflation of the copy number from 7 to 24.

Module M00149 also fits into this case:

K00241+(K00242,K18859,K18860)+K00239+K00240

I double checked the computation for M00149 manually. There were 17 hits to K00239, 19 hits to K00240, 12 to K00241, and 6 to K00242. This leads to a computation like this:

min(K00241, (K00242 + K18859 + K18860), K00239, K00240) min(12, (6+0+0), 17, 19) min(12,6,17,19) 6

And the post-fix output is correct! 🎉

Larger test

I happened to have a bunch of other synthetic metagenomes on hand, and I ran the same test for all of them to see how many modules had changed copy numbers post-fix:

# in branch `master`
for db in 02_CONTIGS/*.db; do \
  file=$(basename $db);
  name=${file%--contigs.db}; 
  anvi-estimate-metabolism -c $db -O PRE_FIX_${name} --kegg-data-dir ~/Lab/test-kegg/KEGG_2020-12-23_45b7cc2e4fdc/ --add-copy-number; \
done
# in branch `iss2217` (this branch)
for db in 02_CONTIGS/*.db; do \
  file=$(basename $db);
  name=${file%--contigs.db}; 
  anvi-estimate-metabolism -c $db -O POST_FIX_${name} --kegg-data-dir ~/Lab/test-kegg/KEGG_2020-12-23_45b7cc2e4fdc/ --add-copy-number; \
done

for db in 02_CONTIGS/*.db; do \
  file=$(basename $db);
  name=${file%--contigs.db}; 
  diff <(cut -f 1,19- PRE_FIX_${name}_modules.txt ) <(cut -f 1,19- POST_FIX_${name}_modules.txt ) | grep -c '>'; \
  diff <(cut -f 1,19- PRE_FIX_${name}_modules.txt ) <(cut -f 1,19- POST_FIX_${name}_modules.txt ) | grep '>' | cut -d ' ' -f 2 | cut -f 1 >> modules_affected.txt; \
done
sort -u modules_affected.txt

I could see that in each metagenome, between 1 and 3 modules were affected by the bug, and the output of the sort -u command shows that these include only M00083, M00144, and M00149 (as we saw in the initial test).

So the fix seems to be working, and moreover only 3 modules were affected by the bug in my synthetic dataset (which was made up of fake metagenomes generated from random genomes in GTDB) :)

merenlab / anvio

Fix bug in stepwise copy number calculation for issue #2217 #2218

Larger test