soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Incosistent behavior of `hhconsensus -M X`? #298

Open alephreish opened 2 years ago

alephreish commented 2 years ago

Expected Behavior

hhconsensus seems to be the recommended first step in converting MSA in fasta to a3m that allows to skip reformat.pl and thus would be expected to generate identical output to a workflow that has reformat.pl as a first step.

Current Behavior

hhconsensus -M X generates a3m's that are different from those generated for alignments first converted to a3m with reformat.pl -M X

Steps to Reproduce (for bugs)

Not a minimal example, just an ad hoc alignment of 383 sequences: seq.txt derived from this dataset.

reformat.pl fas a3m seq.txt /dev/stdout -M 50 -v 0 | hhconsensus -maxres 65535 -i stdin -o reformat_hhcons.a3m
# - XX:XX:XX.XXX INFO: Reading HMM / multiple alignment from standard input ...
hhconsensus -maxres 65535 -i seq.txt -o hhcons.a3m -M 50
# - XX:XX:XX.XXX INFO: seq.txt is in A2M, A3M or FASTA format

The a3m file generated by hhconsensus differs from the one that was created by the reformat.pl | hhconsensus pipeline in the distribution of the gaps and consequently in the sequence of the consensus. Most noticeably, hhcons.a3m has a single sequence in the whole alignment that doesn't start with a gap despite the -M 50 argument:

# skip header and consensus sequence, discard deflines and count sequences not starting with a gap
tail -n+4 reformat_hhcons.a3m | grep -v '^>' | grep -vc ^-`
# 305
tail -n+4 hhcons.a3m | grep -v '^>' | grep -vc ^-
# 1

HH-suite Output (for bugs)

No errors, only the above messages

Context

This seems to be a bug, but I might be missing something. This is certainly not an intuitive behavior.

Your Environment

$ hhconsensus
HHconsensus 3.3.0