Open ajbordner opened 1 year ago
I've been running into this issue lately, and it looks like it's database-related (as far as I understand it). Or at least the trigger is the inclusion of specific records from the database.
Out of curiosity, I ran your sequence (@ajbordner) through hhblits
with the same settings, then quickly compared the output to my own. Both our sequences get (3753 unique) warnings for the same database records (warning_ids.txt) related to maximum number of residues 32763 exceeded in sequence [ID]
.
Here's a pairwise comparison of our sequences:
ajbordner_seq 1 LEVQVPED-PVVALVGTDATLCCSFSPEPGFSLAQLNLIW-QLTDTKQLV 48
|||:.:. |.:...|....:.|..| |:.....|:.| :.:..|.|.
jstrobaek_seq 1 -EVQLQQSGPELVKPGASVRMPCKAS---GYIFTDYNMDWVKRSHGKSLE 46
ajbordner_seq 49 HSFAEGQDQGSAYANRTALFPDLLAQGNASLRLQRVRVADEGSFTCFVSI 98
.......:.|....|:. .:|.|:| ..|:.|.|.::.:
jstrobaek_seq 47 WIGDINPNNGGTIYNQK-------FKGKATL------TVDKSSSTAYMEL 83
ajbordner_seq 99 R-------------------DFGSAAVSLQVAAPYSKPSMT--LEPNKDL 127
| |:.....|:.|::..:.|... |.|....
jstrobaek_seq 84 RSLTSEDTAVYYCARGRIAMDYWGQGTSVTVSSAKTTPPSVYPLAPGSAA 133
ajbordner_seq 128 RPGDTVTITCSSYQGYPE-AEVFWQDGQGVPLTGNVTTSQMANEQGLFDV 176
:....||:.|.....:|| ..|.|..|. |:..|.|.....:..|:.:
jstrobaek_seq 134 QTNSMVTLGCLVKGYFPEPVTVTWNSGS---LSSGVHTFPAVLQSDLYTL 180
ajbordner_seq 177 HSILRVVLGA--NGTYSCLVRNPVLQQDAHSSVTITPQRSPTGAVEVQVP 224
.|.:.|.... :.|.:|.|.:|. ||..:..:..|
jstrobaek_seq 181 SSSVTVPSSTWPSETVTCNVAHPA------SSTKVDKKIVP--------- 215
ajbordner_seq 225 EDPVVALVGTDATLRCSFSPEPGFSLAQLNLIWQLTDTKQLVHSFTEGRD 274
jstrobaek_seq 216 -------------------------------------------------- 215
ajbordner_seq 275 QGSAYANRTALFPDLLAQGNASLRLQRVRVADEGSFTCFVSIRDFGSAAV 324
jstrobaek_seq 216 -------------------------------------------------- 215
ajbordner_seq 325 SLQVAAPYSKPSMTLEPNKDLRPGDTVTITCSSYRGYPEAEVFWQDGQGV 374
jstrobaek_seq 216 -------------------------------------------------- 215
ajbordner_seq 375 PLTGNVTTSQMANEQGLFDVHSVLRVVLGANGTYSCLVRNPVLQQDAHGS 424
jstrobaek_seq 216 -------------------------------------------------- 215
ajbordner_seq 425 VTITGQPMTFPPEA 438
jstrobaek_seq 216 -------------- 215
Also, the error reported for both these runs are the same:
# ajbordner_seq error
ERROR: TITIN (Fragment) n=1 Tax=Poeciliopsis prolifica TaxID=188132 RepID=A0A0S7EPY7_9TELE
ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:1244: Compress:
ERROR: sequences in merged A3M file do not all have the same number of columns,
ERROR: e.g. first sequence and sequence UniRef100_A0A0S7EPY7.
# jstrobaek_seq error
ERROR: TITIN (Fragment) n=1 Tax=Poeciliopsis prolifica TaxID=188132 RepID=A0A0S7EPY7_9TELE
ERROR: Error in /tmp/hh-suite/src/hhalignment.cpp:1244: Compress:
ERROR: sequences in merged A3M file do not all have the same number of columns,
ERROR: e.g. first sequence and sequence UniRef100_A0A0S7EPY7.
Haven't looked any further into the records referenced in the warnings (not sure how, or where to begin ATM), but I'm considering removing them from the database to see if that resolves the issue (I'll give an update if I try that out). Potentially, I'll try to download the databases from scratch, since that seems to have helped for some (see deepmind/alphafold/issues/#97), although my checksums are fine so I doubt it would make a difference.
I'm guessing this is the root cause of some other reported issues, including deepmind/alphafold/issues/#661 and deepmind/alphafold/issues/#238.
Hi!
did you have a possibility to have a look into this problem? we are hitting this issue as well
thanks! k.
Also running into a similar issue on fresh install.
run_docker.py:258] - 06:41:34.166 WARNING: Number of match columns too large. Only first 19999 match columns will be kept!
and
run_docker.py:258] - 06:41:39.569 WARNING: maximum number of residues 32763 exceeded in sequence UniRef100_UPI0005F4645A titin n=1 Tax=Colobus angolensis palliatus TaxID=336983 RepID=UPI0005F4645A
Pretty sure I am seeing the same issue on a few different sequences. I realized that the merged MSA is included in the error output, and with a rather disgusting chain of grep and awk commands wrangled out just the sequences. I had been hoping that just one or a few sequences would have different numbers of columns, but it looks like most lines have different numbers of columns. In case anyone is curious or wants to do the same with their output files (assuming everybody stores the output with something like nohup), this is what I eventually built into a bash script to look at the merged MSA buried in the error:
grep -e "141]" $1 | grep -v "WARNING" | grep -v " INFO" | awk '{$1=$2=$3=$4=$5=$6=$7=$8=""; print $0}' | grep -v "ERROR. 1" | grep -v "ERROR. 2" | grep -v "ERROR. 3" | grep -v "ERROR. 4" | grep -v "ERROR. 5" | grep -v "ERROR. 6" | grep -v "ERROR. 7" | grep -v "ERROR. 8" | grep -v "ERROR. 9" | awk '{$1=$1};1' | grep -v "UniRef" > test7.txt
I am sure someone else could make something much prettier to do the job, but this does appear to have worked.
Expected Behavior
HHblits command to generate MSAs in AlphaFold2 from BFD and UniRef30 databases should complete without errors.
Current Behavior
The hhblits command in AlphaFold2 often fails with the error message: "sequences in merged A3M file do not all have the same number of columns," and the program fails. This happens for many sequences (not just for the example provided) and prevents the latest version of AlphaFold2 from running using the full BFD database.
The error on occurs if both databases (BFD and UniRef30) are searched.
Steps to Reproduce (for bugs)
Create the following FASTA file named test.fasta:
Download the latest BFD and UniRef30 databases for AlphaFold2 v2.3.0 to.
Run this hhblits command: hhblits -i test.fasta -cpu 4 -oa3m output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /uniref30/UniRef30_2021_03
HH-suite Output (for bugs)
hhblits_test_AF_v2.3.0.log
Context
Refer to original AlphaFold2 issue at https://github.com/deepmind/alphafold/issues/661
Environment