soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

Running hhblits with the same inputs twice yields different a3m alignments #299

Open tanhevg opened 2 years ago

tanhevg commented 2 years ago

Expected Behavior

I am expecting hhblits to produce the same alignments when executed on the same input multiple times

Current Behavior

When default settings are used, alignments are the same. But when the defaults are changed even in a non-meaningful way, like adding -e 0.001, results from two runs begin to diverge. The more non-standard options are added, the more the results diverge (alphafold in the archive). Reading the input sequence from file, as opposed to standard input in the steps below, might change the behaviour slightly, but the fundamental problem stays.

Steps to Reproduce (for bugs)

All the scripts and input files are in the archive attached . First, run hhblits with default parameters twice:

cat test.a3m | hhblits -i stdin -oa3m test1.a3m -o test1.hhr -cpu 8 -d /data/uniclust/uniclust30_2018_08/uniclust30_2018_08 > test1.out 2>test1.err
cat test.a3m | hhblits -i stdin -oa3m test2.a3m -o test2.hhr -cpu 8 -d /data/uniclust/uniclust30_2018_08/uniclust30_2018_08 > test2.out 2>test2.err

Observe that test1.a3m and test2.a3m contain the identical set of sequences (there is no diff):

grep '^>' test1.a3m | awk -F '|' '{print $2}' | sort -u > test1_a3m_list.txt
grep '^>' test2.a3m | awk -F '|' '{print $2}' | sort -u > test2_a3m_list.txt
diff -q test1_a3m_list.txt test2_a3m_list.txt

Now, modify the hhblits command line slightly:

 cat test.a3m | hhblits -i stdin -oa3m test1.a3m -o test1.hhr -cpu 8 -e 0.001 -d /data/uniclust/uniclust30_2018_08/uniclust30_2018_08 > test1.out 2>test1.err
 cat test.a3m | hhblits -i stdin -oa3m test2.a3m -o test2.hhr -cpu 8 -e 0.001 -d /data/uniclust/uniclust30_2018_08/uniclust30_2018_08 > test2.out 2>test2.err 

The same script for validating the results now shows a non-empty diff.

HH-suite Output (for bugs)

All standard output and error streams are included in the archive

Context

I was actually playing with alphafold, and was curious if the same features are used every time. Running hhblits twice on the same input with alphafold settings yields even more different results.

There was a similar issue filed a while ago, #198, but there has been no activity for some time, so I decided to raise another one.

Your Environment

KK666-AI commented 2 years ago

how to reproduce the result for hhblits? it there any solution?