Open dvs opened 5 years ago
Thank you for reporting this detailed analysis of an issue. One possible fix would be to adjust the masking code to now allow for the corner extension and limit the band only around the viterbi alignment. Disadvantage is that an alignment can then not be global anymore. The best solution would be a banded mact that starts at some reliable point of the viterbi alignment. However we do not have the resources to develop something like this now.
One possible fix would be to adjust the masking code to now allow for the corner extension and limit the band only around the viterbi alignment.
You mean "to not allow for the corner extension" ? This would make new version incompatible with old results produced when extensions were allowed (could be considered as a regression). I also thought there is already a band around Viterbi alignment (FWD_BKW_PATHWITDH=40; // cell off path width around viterbi alignment).
The best solution would be a banded mact that starts at some reliable point of the viterbi alignment. However we do not have the resources to develop something like this now.
Perhaps, it's not difficult to detect such cases and fall back to Viterbi alignment with appropriate warning message? I mean that the partial hit loss can be detected by significant decrease of number of matched columns and wrong hit location can be detected even more easily. I think I can test this approach and make a pull request.
Also, User Guide What does the maximum accuracy alignment algorithm do? section warns about possible discrepancies between scores and alignments of MAC hits, do you agree that it would be useful if it warned about possible issues with the current MAC algorithm as well ?
I've submitted a pull request for the changes that address this issue. Note also that the described MAC hit split disappears when -mact 0.33 is used (instead of the default -mact 0.35). The proposed code improvement in this case restores Viterbi hit and also issues a warning that suggests user to lower -mact option value in order to avoid MAC hit degradation.
Expected Behavior
MAC realignment algorithm is enabled by default in hhsearch/hhblits/hhalign and it's expected to improve the quality of hits' alignments.
Current Behavior
In some cases due to MAC realignment step a significant part of the local hit is lost while the probability/p-value/e-value/scores are reported as if the whole hit were preserved. Moreover, the lost part of the hit might be then wrongly reported in some alternative hit slot with unrelated probability/p-value/e-value/scores.
Steps to Reproduce
See the test example in the attached ZIP archive where YP_005352862 protein query is searched in PF09848.8 template with request for no more than five alternative hits ("-alt 5" option). There are two HHalign output files for default mode (MAC is used, .hhr file) and "-norealign" mode (w/o MAC, -nr.hhr file).
MAC-report.zip
When MAC algorithm is disabled, i.e. "-norealign" option is used, the strongest hit covers all the template (1-359) with 259 matching columns and significant probability/score (99.1% and 152.1, respectively).
In the default mode, when MAC is enabled, the strongest hit is attributed with the same probability/score but it covers only part of the template: (2-163), 148 matching columns. Some fragment of the lost part of the hit is listed below as an alternative hit, see the hit #4: (261-357), 50 matching columns. However, the probability/score of the hit #4 are wrong, they correspond to the alternative hit #5 of the report where MAC is not used. It appears to be some kind of mix-up, see the discussion below.
Discussion
When MAC algorithm is enabled PosteriorDecoder::realign() method is called for each Viterbi local hit. This procedure performs the following steps for improving the hit's alignment:
It's implicitly assumed that the MAC local hit produced by macAlgorithm() and backtraceMAC() corresponds to the initial Viterbi hit and covers the same query/template region. However it's not necessarily true, as it follows from the test example, one Viterbi hit could be split into several local MAC hits. If this happens only the coordinates of the strongest MAC hit are reported by macAlgorithm().
Also during the work of MAC algorithm viterbi_matrix is masked in a special way which disables search in the regions that are in conflict with the initial Viterbi hit (and previously generated MAC hits) or diverge too far from the Viterbi hit path. But the mask allows the extension of the Viterbi hit: see PosteriorDecoder::maskViterbiAlignment() method where upper-left and down-right corners are not masked. This feature allows the algorithm to produce MAC hit which is completely outside of the Viterbi hit (i.e. lying both upstream or downstream in query/template space). It happened in the test example where for the hit #4 the second part of the strongest hit (261-357) was reported instead of the weak hit (3-47) that should have been reported. Note that even though this part was eventually reported it happened accidentally, its probability/scores are wrong, and unless "-smin 0" option were used this hit wouldn't be listed.
Unfortunately, there is no apparent way to fix these issues in the code. The second issue (MAC hit outside Viterbi hit's scope) can be easily detected in backtraceMAC() but at this point only the strongest hit is known, because it's the only output from macAlgorithm(). Another issue, MAC hit split, would also need careful consideration of several local MAC hits that might comprise one hit (but also could be in conflict with each other).