soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
529 stars 133 forks source link

HHblits error "sequences in merged A3M file do not all have the same number of columns," #337

Open ajbordner opened 1 year ago

ajbordner commented 1 year ago

Expected Behavior

HHblits command to generate MSAs in AlphaFold2 from BFD and UniRef30 databases should complete without errors.

Current Behavior

The hhblits command in AlphaFold2 often fails with the error message: "sequences in merged A3M file do not all have the same number of columns," and the program fails. This happens for many sequences (not just for the example provided) and prevents the latest version of AlphaFold2 from running using the full BFD database.

The error on occurs if both databases (BFD and UniRef30) are searched.

Steps to Reproduce (for bugs)

  1. Create the following FASTA file named test.fasta:

    chain_A LEVQVPEDPVVALVGTDATLCCSFSPEPGFSLAQLNLIWQLTDTKQLVHSFAEGQDQGSAYANRTALFPDLLAQGNASLRLQRVRVADEGSFTCFVSIRDFGSAAVSLQVAAPYSKPSMTLEPNKDLRPGDTVTITCSSYQGYPEAEVFWQDGQGVPLTGNVTTSQMANEQGLFDVHSILRVVLGANGTYSCLVRNPVLQQDAHSSVTITPQRSPTGAVEVQVPEDPVVALVGTDATLRCSFSPEPGFSLAQLNLIWQLTDTKQLVHSFTEGRDQGSAYANRTALFPDLLAQGNASLRLQRVRVADEGSFTCFVSIRDFGSAAVSLQVAAPYSKPSMTLEPNKDLRPGDTVTITCSSYRGYPEAEVFWQDGQGVPLTGNVTTSQMANEQGLFDVHSVLRVVLGANGTYSCLVRNPVLQQDAHGSVTITGQPMTFPPEA

  2. Download the latest BFD and UniRef30 databases for AlphaFold2 v2.3.0 to .

  3. Run this hhblits command: hhblits -i test.fasta -cpu 4 -oa3m output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /uniref30/UniRef30_2021_03

HH-suite Output (for bugs)

hhblits_test_AF_v2.3.0.log

Context

Refer to original AlphaFold2 issue at https://github.com/deepmind/alphafold/issues/661

Environment

jstrobaek commented 1 year ago

I've been running into this issue lately, and it looks like it's database-related (as far as I understand it). Or at least the trigger is the inclusion of specific records from the database.

Out of curiosity, I ran your sequence (@ajbordner) through hhblits with the same settings, then quickly compared the output to my own. Both our sequences get (3753 unique) warnings for the same database records (warning_ids.txt) related to maximum number of residues 32763 exceeded in sequence [ID].

Here's a pairwise comparison of our sequences:

ajbordner_seq      1 LEVQVPED-PVVALVGTDATLCCSFSPEPGFSLAQLNLIW-QLTDTKQLV     48
                      |||:.:. |.:...|....:.|..|   |:.....|:.| :.:..|.|.
jstrobaek_seq      1 -EVQLQQSGPELVKPGASVRMPCKAS---GYIFTDYNMDWVKRSHGKSLE     46

ajbordner_seq     49 HSFAEGQDQGSAYANRTALFPDLLAQGNASLRLQRVRVADEGSFTCFVSI     98
                     .......:.|....|:.       .:|.|:|      ..|:.|.|.::.:
jstrobaek_seq     47 WIGDINPNNGGTIYNQK-------FKGKATL------TVDKSSSTAYMEL     83

ajbordner_seq     99 R-------------------DFGSAAVSLQVAAPYSKPSMT--LEPNKDL    127
                     |                   |:.....|:.|::..:.|...  |.|....
jstrobaek_seq     84 RSLTSEDTAVYYCARGRIAMDYWGQGTSVTVSSAKTTPPSVYPLAPGSAA    133

ajbordner_seq    128 RPGDTVTITCSSYQGYPE-AEVFWQDGQGVPLTGNVTTSQMANEQGLFDV    176
                     :....||:.|.....:|| ..|.|..|.   |:..|.|.....:..|:.:
jstrobaek_seq    134 QTNSMVTLGCLVKGYFPEPVTVTWNSGS---LSSGVHTFPAVLQSDLYTL    180

ajbordner_seq    177 HSILRVVLGA--NGTYSCLVRNPVLQQDAHSSVTITPQRSPTGAVEVQVP    224
                     .|.:.|....  :.|.:|.|.:|.      ||..:..:..|         
jstrobaek_seq    181 SSSVTVPSSTWPSETVTCNVAHPA------SSTKVDKKIVP---------    215

ajbordner_seq    225 EDPVVALVGTDATLRCSFSPEPGFSLAQLNLIWQLTDTKQLVHSFTEGRD    274

jstrobaek_seq    216 --------------------------------------------------    215

ajbordner_seq    275 QGSAYANRTALFPDLLAQGNASLRLQRVRVADEGSFTCFVSIRDFGSAAV    324

jstrobaek_seq    216 --------------------------------------------------    215

ajbordner_seq    325 SLQVAAPYSKPSMTLEPNKDLRPGDTVTITCSSYRGYPEAEVFWQDGQGV    374

jstrobaek_seq    216 --------------------------------------------------    215

ajbordner_seq    375 PLTGNVTTSQMANEQGLFDVHSVLRVVLGANGTYSCLVRNPVLQQDAHGS    424

jstrobaek_seq    216 --------------------------------------------------    215

ajbordner_seq    425 VTITGQPMTFPPEA    438

jstrobaek_seq    216 --------------    215

Also, the error reported for both these runs are the same:

# ajbordner_seq error
ERROR: TITIN (Fragment) n=1 Tax=Poeciliopsis prolifica TaxID=188132 RepID=A0A0S7EPY7_9TELE
ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:1244: Compress:

ERROR:   sequences in merged A3M file do not all have the same number of columns, 
ERROR:   e.g. first sequence and sequence UniRef100_A0A0S7EPY7.

# jstrobaek_seq error
ERROR: TITIN (Fragment) n=1 Tax=Poeciliopsis prolifica TaxID=188132 RepID=A0A0S7EPY7_9TELE
ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:1244: Compress:

ERROR:   sequences in merged A3M file do not all have the same number of columns, 
ERROR:   e.g. first sequence and sequence UniRef100_A0A0S7EPY7.

Haven't looked any further into the records referenced in the warnings (not sure how, or where to begin ATM), but I'm considering removing them from the database to see if that resolves the issue (I'll give an update if I try that out). Potentially, I'll try to download the databases from scratch, since that seems to have helped for some (see deepmind/alphafold/issues/#97), although my checksums are fine so I doubt it would make a difference.

I'm guessing this is the root cause of some other reported issues, including deepmind/alphafold/issues/#661 and deepmind/alphafold/issues/#238.

kpetrov commented 1 year ago

Hi!

did you have a possibility to have a look into this problem? we are hitting this issue as well

thanks! k.

linuxfold commented 1 year ago

Also running into a similar issue on fresh install.

run_docker.py:258] - 06:41:34.166 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!

and

run_docker.py:258] - 06:41:39.569 WARNING: maximum number of residues 32763 exceeded in sequence UniRef100_UPI0005F4645A titin n=1 Tax=Colobus angolensis palliatus TaxID=336983 RepID=UPI0005F4645A

tcoates5 commented 9 months ago

Pretty sure I am seeing the same issue on a few different sequences. I realized that the merged MSA is included in the error output, and with a rather disgusting chain of grep and awk commands wrangled out just the sequences. I had been hoping that just one or a few sequences would have different numbers of columns, but it looks like most lines have different numbers of columns. In case anyone is curious or wants to do the same with their output files (assuming everybody stores the output with something like nohup), this is what I eventually built into a bash script to look at the merged MSA buried in the error:

grep -e "141]" $1 | grep -v "WARNING" | grep -v " INFO" | awk '{$1=$2=$3=$4=$5=$6=$7=$8=""; print $0}' | grep -v "ERROR. 1" | grep -v "ERROR. 2" | grep -v "ERROR. 3" | grep -v "ERROR. 4" | grep -v "ERROR. 5" | grep -v "ERROR. 6" | grep -v "ERROR. 7" | grep -v "ERROR. 8" | grep -v "ERROR. 9" | awk '{$1=$1};1' | grep -v "UniRef" > test7.txt

I am sure someone else could make something much prettier to do the job, but this does appear to have worked.