soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

hhalignment: did not find [x] match states in sequence [i] #277

Open johnlees opened 2 years ago

johnlees commented 2 years ago

:exclamation: Make to check out our User Guide.

I am trying to run alphafold (using the docker image in the current readme), but an error is encountered when running the HHblits step. Apologies if this is actually an alphafold issue – do let me know and I will close this issue and post there.

I am not sure what this error message means/what it implies – if you could help me understand further I would be happy to look into this in more detail if I can.

Expected Behavior

hhblits completes alignment

Current Behavior

Error hit at hh-suite/src/hhalignment.cpp:3539

The error lines are (full output in gist below):

I0719 14:43:07.016577 139929070653184 run_docker.py:180] E0719 13:43:07.009330 139663816972096 hhblits.py:141] - 13:43:03.185 ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:3539: MergeMasterSlave:
I0719 14:43:07.016653 139929070653184 run_docker.py:180] E0719 13:43:07.009389 139663816972096 hhblits.py:141] - 13:43:03.185 ERROR:    did not find 372 match states in sequence 1 of ERR550514_1169578. Sequence:
I0719 14:43:07.016733 139929070653184 run_docker.py:180] E0719 13:43:07.009447 139663816972096 hhblits.py:141] RSSISRGRTTRPYRTALWLRMAMTSWVHASSTFSLLLALHLTTVGAEQCIEQLGECGAPDSDALEVFHDDEPPANLVSLLQREVLLHRGAGLSVGRHHEQAAAGSNHSRSSTGHAAHEPRGSQSGPKRAKSGATPTSEPIIVRTDLDRTLAGKLADIAKVVHGSMAEMGSVVAGYEDRHLSHVFALVELSSGRHGATGATRTVGARWHVLGSGLVVSLVALVSCFVAFCRHKDQVTKEGEEGSMPVSLPPDTVVSLFQERIIGQPADTALELPGGMHLSYGELAGQVEGLASRIRSAGVGEAAPGVVATLFPEGTTVEHIVCALAVLHAGAVWLPLDPTLSQERLSAALADSGTRLVIT

Steps to Reproduce (for bugs)

  1. Set up alphafold following the readme
  2. Run with
    python3 docker/run_docker.py --fasta_paths=seqs/dltA.fa --max_template_date=2021-07-19

    dltA.fa:

    >SP_2176
    MSNKPIADMIETIEHFAQTQPSYPVYNVLGQEHTYGDLKADSDSLAAVIDQLGLPEKSPVVVFGGQEYEMLATFVALTKSGHAYIPIDSHSALERVSAILEVAEPSLIIAISAFPLEQVSTPMINLAQVQEAFAQGNNYEITHPVKGDDNYYIIFTSGTTGKPKGVQISHDNLLSFTNWMITDKEFATPSRPQMLAQPPYSFDLSVMYWAPTLALGGTLFTLPSVITQDFKQLFAAIFSLPIAIWTSTPSFADMAMLSEYFNSEKMPGITHFYFDGEELTVKTAQKLRERFPNARIINAYGPTEATVALSAVAVTDEMLATLKRLPIGYTKADSPTFIIDEEGNKLPNGEQGEIIVSGPAVSKGYMNNPEKTAEAFFEFEDLPAYHTGDVGTMTDEGLLLYGGRMDFQIKFNGYRIELEDVSQNLNKSRFIESAVAVPRYNKDHKVQNLLAYVILKDGVREQFERDIDITKAIKEDLTDIMMSYMMPSKFLYRDSLPLTPNGKIDIKGLINEVNKR

HH-suite Output (for bugs)

https://gist.github.com/johnlees/535f1012fbbded1ffaa499f40cbd4bdf

Context

Running via alphafold docker image and databases

Your Environment

johnlees commented 2 years ago

Also to note, this does work on other input sequences

milot-mirdita commented 2 years ago

Do you have the full command line call that was passed to HHblits? AlphaFold should print it (logging.info('Launching subprocess "%s"', ' '.join(cmd))) somewhere. I can't reproduce the issue with my local BFD. Did you check that all extracted files have the same hashes as posted on the https://bfd.mmseqs.com website? This might stem from a corrupted database.

johnlees commented 2 years ago

Thanks for the reply and trying to reproduce. Unfortunately little more info from my end

Do you have the full command line call that was passed to HHblits? AlphaFold should print it (logging.info('Launching subprocess "%s"', ' '.join(cmd))) somewhere.

Sorry that I forgot to include this line in the gist:

/usr/bin/hhblits -i /mnt/fasta_path_0/dltA.fa -cpu 4 -oa3m /tmp/tmp_7lf6_vd/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /mnt/bfd_database_path/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /mnt/uniclust30_database_path/uniclust30_2018_08

I can't reproduce the issue with my local BFD. Did you check that all extracted files have the same hashes as posted on the https://bfd.mmseqs.com website? This might stem from a corrupted database.

for i in $(ls /media/mirrored-hdd/jlees/alphafold/bfd/); do openssl md5 /media/mirrored-hdd/jlees/alphafold/bfd/$i; done
MD5(/media/mirrored-hdd/jlees/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata)= 2dc0f09adabbcf1965ed578e0b2ab07e
MD5(/media/mirrored-hdd/jlees/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex)= 476941cf4a964d96fb3b68a82fe734d1
MD5(/media/mirrored-hdd/jlees/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata)= 4bb63ac9c3a3dd088cf654df1f548d53
MD5(/media/mirrored-hdd/jlees/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex)= 26d48869efdb50d036e2fb9056a0ae9d
MD5(/media/mirrored-hdd/jlees/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata)= 9bd2da8a8adbcc30801f0221d0dc1987
MD5(/media/mirrored-hdd/jlees/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex)= 799f308b20627088129847709f1abed6

which appear to be correct:

bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata
2dc0f09adabbcf1965ed578e0b2ab07e
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex
476941cf4a964d96fb3b68a82fe734d1
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata
4bb63ac9c3a3dd088cf654df1f548d53
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
26d48869efdb50d036e2fb9056a0ae9d
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata
9bd2da8a8adbcc30801f0221d0dc1987
bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex
799f308b20627088129847709f1abed6

Tried running again, and the same result. A number of other input sequences have now worked without issue.

milot-mirdita commented 2 years ago

Can you check the uniclust too:

67c2a154110092270969ed2a971140bf  uniclust30_2018_08_cs219.ffindex
8823ce08c282d631ddcc380ff33db61a  uniclust30_2018_08_hhm.ffindex
e0d1eb872ac322280f46f12f10441c45  uniclust30_2018_08_a3m.ffindex
a0b4fc3328c89696f32b912563d51c10  uniclust30_2018_08_cs219.ffdata
2eb69c983e61337d42f7f63576728a1f  uniclust30_2018_08_hhm.ffdata
f506b0aa3c64db05c3e436bb26730275  uniclust30_2018_08_a3m.ffdata
milot-mirdita commented 2 years ago

I reproduced your issue. Weirdly it only happens in this specific combination of BFD and UC 2018_08. Either database alone, or a newer UC version does not result in this error.

johnlees commented 2 years ago

Thanks for looking into this further, and glad it ended up being reproducible. I guess for me the easiest solution will be to update the uniclust DB?

milot-mirdita commented 2 years ago

That's probably the easiest solution for now. We will need some time to investigate the root cause.

zhoujingyu13687306871 commented 2 years ago

Hi! Is there a better solution to this problem at present? When I run the job here, the above error message also appears @milot-mirdita @

yuzhiguo07 commented 2 years ago

Same error here. Is there a better solution to this problem?

gahdritz commented 2 years ago

I just ran into it too. Is there a specific newer version of UniClust that does the job?

tomgoddard commented 2 years ago

I also have seen several instances of this hhblits error with different sequences using AlphaFold. I doubt the problem is with UniClust and updating that database will likely just make the errors occur with different sequences.

Samuel-gwb commented 2 years ago

This also happened to one protein sequence I'm working on, while no similar problem for other sequences.

DS-unib commented 2 years ago

I second this problem with the 2.2.0 version:

ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:3539: MergeMasterSlave:

jkosinski commented 2 years ago

I also have this issue:

Expected Behavior

hhblits does not crash on this sequence:

>Q13838_DX39B_HUMAN
MAENDVDNELLDYEDDEVETAAGGDGAEAPAKKDVKGSYVSIHSSGFRDFLLKPELLRAIVDCGFEHPSEVQHECIPQAILGMDVLCQAKSGMGKTAVFVLATLQQLEPVTGQVSVLVMCHTRELAFQISKEYERFSKYMPNVKVAVFFGGLSIKKDEEVLKKNCPHIVVGTPGRILALARNKSLNLKHIKHFILDECDKMLEQLDMRRDVQEIFRMTPHEKQVMMFSATLSKEIRPVCRKFMQDPMEIFVDDETKLTLHGLQQYYVKLKDNEKNRKLFDLLDVLEFNQVVIFVKSVQRCIALAQLLVEQNFPAIAIHRGMPQEERLSRYQQFKDFQRRILVATNLFGRGMDIERVNIAFNYDMPEDSDTYLHRVARAGRFGTKGLAITFVSDENDAKILNDVQDRFEVNISELPDEIDISSYIEQTR

and using uniclust30_2018_08 database.

Current Behavior

hhblits crashes with the error:

...
- 13:20:30.255 INFO: Realigning 33501 HMM-HMM alignments using Maximum Accuracy algorithm

- 13:34:59.564 ERROR: Error in /tmp/eb-build/HHsuite/3.3.0/gompic-2020b/hh-suite-3.3.0/src/hhalignment.cpp:3539: MergeMasterSlave:

- 13:34:59.564 ERROR:   did not find 548 match states in sequence 1 of SRR5579859_7281350. Sequence:
 PGLGQNGAMPGIAWFKLTDPGGELPAVSSDTDLRILLPEGDEFGIQARRLADAGAQVRQVRYLLEDEAITGEGKRREVITWLSRPSQPGGGPYAKVTGPATTGARDAFELMWQDQALPIGQAAMRTRVPAVLAAFLPFSTLNPAQAEIVPEVLGHDQNLLVVAPTGAGKTVIGMAAGLKAVLEQKRKAAWLVPQRSLTDELDRELADWRGRGLRVERLSGE

There are also some other sequences crashing like this, can provide them if useful

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps.

hhblits -i crashing_seq.fasta -cpu 12 -oa3m /scratch/kosinski/output.a3m -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /scratch/AlphaFold_DBs/2.2.0/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /scratch/AlphaFold_DBs/2.2.0/uniclust30/uniclust30_2018_08/uniclust30_2018_08

HH-suite Output (for bugs)

Please make sure to post the complete output of the tool you called. Please use gist.github.com.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the issue in.

ksteczk commented 2 years ago

Janek, I ran your command /opt/hhsuite/bin/hhblits -i query-kosinski.fst -cpu 100 -oa3m query-kosinski.a3m -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /db/hh/UniRef30_2020_06 -d /home/db/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt

just now on my computer with your sequence and it finished without any error...

I got 6298 lines in a3m output. Maybe there's a problem with the computational resources you are using, or the binaries (did you compile them for the machine?)? bfd is quite big... I ran it on a single 256GB RAM machine, which is not big as for today's standards, with local storage. That's what I was able to check. Unfortunately, I didn't get any error.

jkosinski commented 2 years ago

Janek, I ran your command /opt/hhsuite/bin/hhblits -i query-kosinski.fst -cpu 100 -oa3m query-kosinski.a3m -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /db/hh/UniRef30_2020_06 -d /home/db/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt

just now on my computer with your sequence and it finished without any error...

I got 6298 lines in a3m output. Maybe there's a problem with the computational resources you are using, or the binaries (did you compile them for the machine?)? bfd is quite big... I ran it on a single 256GB RAM machine, which is not big as for today's standards, with local storage. That's what I was able to check. Unfortunately, I didn't get any error.

Thanks Kamil for checking this. I can see that you checked on UniRef30_2020_06 while I used uniclust30_2018_08, as others in this feed, probably that is why it works for you. We want to use uniclust30_2018_08 because AlphaFold uses it. Can you check with uniclust30_2018_08?

BTW. it's not related to memory, longer sequences with bigger alignment run just fine on our setup.

@DS-unib also gets this error on the same sequence and database. I have four more sequences behaving like this (out of hundreds that run fine).

jkosinski commented 2 years ago

Janek, I ran your command /opt/hhsuite/bin/hhblits -i query-kosinski.fst -cpu 100 -oa3m query-kosinski.a3m -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /db/hh/UniRef30_2020_06 -d /home/db/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt just now on my computer with your sequence and it finished without any error... I got 6298 lines in a3m output. Maybe there's a problem with the computational resources you are using, or the binaries (did you compile them for the machine?)? bfd is quite big... I ran it on a single 256GB RAM machine, which is not big as for today's standards, with local storage. That's what I was able to check. Unfortunately, I didn't get any error.

Thanks Kamil for checking this. I can see that you checked on UniRef30_2020_06 while I used uniclust30_2018_08, as others in this feed, probably that is why it works for you. We want to use uniclust30_2018_08 because AlphaFold uses it. Can you check with uniclust30_2018_08?

BTW. it's not related to memory, longer sequences with bigger alignment run just fine on our setup.

@DS-unib also gets this error on the same sequence and database. I have four more sequences behaving like this (out of hundreds that run fine).

Hold on, or is the UniRef30_2020_06 basically updated version of uniclust30_2018_08?

ksteczk commented 2 years ago

Hold on, or is the UniRef30_2020_06 basically updated version of uniclust30_2018_08?

I believe yes - they are calling it uniclust but the files are named Uniref30... also, there's even newer version from 2021: http://gwdu111.gwdg.de/~compbiol/uniclust/2021_03/

ksteczk commented 2 years ago

Oh, and indeed, with uniclust'18 it crashed with exactly the same error as yours. :/

jkosinski commented 2 years ago

Hold on, or is the UniRef30_2020_06 basically updated version of uniclust30_2018_08?

I believe yes - they are calling it uniclust but the files are named Uniref30... also, there's even newer version from 2021: http://gwdu111.gwdg.de/~compbiol/uniclust/2021_03/

I tested UniRef30_2020_06 on around 1,000 sequences and as Tom Goddard predicted above, it now crashes with the same error just on different sequences, like this one:

>Q9NZD8_SPG21_HUMAN
MGEIKVSPDYNWFRGTVPLKKIIVDDDDSKIWSLYDAGPRSIRCPLIFLPPVSGTADVFFRQILALTGWGYRVIALQYPVYWDHLEFCDGFRKLLDHLQLDKVHLFGASLGGFLAQKFAEYTHKSPRVHSLILCNSFSDTSIFNQTWTANSFWLMPAFMLKKIVLGNFSSGPVDPMMADAIDFMVDRLESLGQSELASRLTLNCQNSYVEPHKIRDIPVTIMDVFDQSALSTEAKEEMYKLYPNARRAHLKTGGNFPYLCRSAEVNLYVQIHLLQFHGTKYAAIDPSMVSAEELEVQKGSLGISQEEQ

Will check UniRef30_2021_03 tonight.

jkosinski commented 2 years ago

UniRef30_2021_03 gives similar errors just on different sequences.

martin-steinegger commented 2 years ago

The multi-database feature of HH-suite seems the be the problem. If you search against the Uniref30 and the BFD separately it works without crashing. Please do not use this feature.

jkosinski commented 2 years ago

The multi-database feature of HH-suite seems the be the problem. If you search against the Uniref30 and the BFD separately it works without crashing. Please do not use this feature.

Thanks! I hope DeepMind is reading this, as the command is from AlphaFold.

ksteczk commented 2 years ago

Another option would to merge both databases into one (which shouldn't be difficult since it is ffindex based) and modify af2 script to run it on the merged DB.

Janek's question inspired me to check, for a few Pfam DUFs, whether adding BFD to profile building procedure brings new, informative mappings to proteins of known structures - it didn't - but maybe my choices were unfortunate. In the next recalculation of my all vs all mappings database I'll try to use fused databases for a more systematic comparison. Updating to uniref2021 itself boosted the scores a bit. But af2 probably makes use of more nuanced profile properties than simply similarity to known structures/domains so BFD might be beneficial especially in the cases of some orphan sequences, like viral ones...

pt., 6 maj 2022 o 11:58 Jan Kosinski @.***> napisał(a):

The multi-database feature of HH-suite seems the be the problem. If you search against the Uniref30 and the BFD separately it works without crashing. Please do not use this feature.

Thanks! I hope DeepMind is reading this, as the command is from AlphaFold.

— Reply to this email directly, view it on GitHub https://github.com/soedinglab/hh-suite/issues/277#issuecomment-1119451206, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD2CMIZYAVQNNY3GTKI7I4LVITUNPANCNFSM5ATZVFTA . You are receiving this because you commented.Message ID: @.***>

amnag commented 2 years ago

The multi-database feature of HH-suite seems the be the problem. If you search against the Uniref30 and the BFD separately it works without crashing. Please do not use this feature.

Hi @martin-steinegger , how can I run Alphafold with searching against either Uniref30 or BFD ? Alphafold requires both the options --uniclust30_database_path and --bfd_database_path to run to the best of my knowledge. Thanks.

grandrea commented 1 year ago

@jkosinski could you share how to edit the AlphaFold hh-suite call here to prevent this error?

jkosinski commented 1 year ago

Well, I haven't edited the code to enable this. I guess the easiest would have to run hhblits twice and combine the alignments into one file named bfd_uniclust_hits.a3m, but I don't know if that would be equivalent. Does anyone know?

Don't tell anyone but what I did in the end was quite silly:

Out of 5,000 sequences that I have run, I got only one crashed on both with the above error.

But this is definitely not a nice solution ;-)

On Thu, Jul 21, 2022 at 5:07 PM grandrea @.***> wrote:

@jkosinski https://github.com/jkosinski could you share how to edit the AlphaFold hh-suite call here to prevent this error?

— Reply to this email directly, view it on GitHub https://github.com/soedinglab/hh-suite/issues/277#issuecomment-1191601373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMDPOOT2GZL52FH2EBN33LVVFRUDANCNFSM5ATZVFTA . You are receiving this because you were mentioned.Message ID: @.***>

YaoYinYing commented 1 year ago

I downloaded the latest UniRef30_2022_02, and hhblits also reported this error.

jsko-arontier commented 1 year ago

I have been testing with this issue and I think the cause is in UniRef30. The bdf+uniclust30_2018_08 works without any problems, while the bfd+UniRef30_2022_02 always shows the same error.

mtiberti commented 2 months ago

hi, we are also seeing the same issue when running a recent AlphaFold and using UniRef30_2021_03