soedinglab / hh-suite

Remote protein homology detection suite.
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3019-7
GNU General Public License v3.0
515 stars 128 forks source link

hhblits crashes if a sequence id is longer than 31 characters #263

Open konstin opened 3 years ago

konstin commented 3 years ago

I tried to search a fasta file containing ids more than 31 characters using hhblits, which resulted in an opaque error message when hhblits reached that sequence.

Expected Behavior

hhblits works with sequence ids longer than 31 characters

Current Behavior

hhblits should either work with sequence ids longer than 31 characters, or inform the user (ideally in the beginning) the long ids are not supported

Steps to Reproduce (for bugs)

Minimized example cath_5.fasta:

>cath-4_2_0-1a15A00-1-67
KPVSLSYRCPCRFFESHVARANVKHLKILNTPACALQIVARLKNNNRQVCIDPKLKWIQEYLEKALN
>cath-4_2_0-1a1rA02-121-205
TPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPTGHAVGLFRAAVCTRGVAKAVDFIPVENLETTM
>cath-4_2_0-1a1vA02-325-430_452-483
PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
>cath-4_2_0-1a1vA03-432-451_484-624
VTQTVDFSLDPTFTIETTTLGMFDSSVLCECYDAGXAWYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENFPYLVAYQATVCARAQAPPPSWDQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMS
>cath-4_2_0-1a1zA00-1-83
MDPFLVLLHSVSSSLSSSELTELKGLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLRRHDLLRRVDDFELEHHHHHH

Try to build a custom database:

hhsuite/bin/ffindex_from_fasta -s cath_5_fas.ff{data,index} cath_5.fasta
hhsuite/bin/hhblits_omp -i cath_5_fas -oa3m query.a3m -d /path/to/UniRef30_2020_06 -cpu 5

This gives the following output:

- 14:58:10.139 INFO: Searching 25985124 column state sequences.
- 14:58:10.240 INFO: Thread 3   cath-4_2_0-1a1rA02-121-205
- 14:58:10.240 INFO: Thread 1   h-4_2_0-1a15A00-1-67
- 14:58:10.242 INFO: Thread 0   cath-4_2_0-1a1vA02-325-430_452-4|
- 14:58:10.243 INFO: Thread 2   cath-4_2_0-1a1vA03-432-451_484-6A
- 14:58:10.247 INFO: Thread 4   cath-4_2_0-1a1zA00-1-83
- 14:58:10.325 ERROR: Error in /big/martin/hh-suite/src/hhfunc.cpp:83: ReadQueryFile:
- 14:58:10.325 ERROR:   unrecognized input file format in 'cath-4_2_0-1a1vA02-325-430_452-4|'
- 14:58:10.325 ERROR:   line = PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS

Looking into cath_5_fas.ffindex (notice the non-ascii characters):

cath-4_2_0-1a15A00-1-67 0   94
cath-4_2_0-1a1rA02-121-205  94  115
cath-4_2_0-1a1vA02-325-430_452-4¹‚‘o•  209 176
cath-4_2_0-1a1vA03-432-451_484-6¹‚‘o•  385 199
cath-4_2_0-1a1zA00-1-83 584 118

The non-ASCII character get rendered differently depending on the editor:

grafik

To check let's use only ids with length <32:

th-4_2_0-1a15A00-1-67
KPVSLSYRCPCRFFESHVARANVKHLKILNTPACALQIVARLKNNNRQVCIDPKLKWIQEYLEKALN
>cath-4_2_0-1a1rA02-121-205
TPCTCGSSDLYLVTRHADVIPVRRRGDSRGSLLSPRPISYLKGSSGGPLLCPTGHAVGLFRAAVCTRGVAKAVDFIPVENLETTM
>cath-4_2_0-1a1vA02-325-430_452-
PPGSVTVPHPNIEEVALSTTGEIPFYGKAIPLEVIKGGRHLIFCHSKKKCDELAAKLVALGINAVAYYRGLDVSVIPTSGDVVVVATDALMTGFTGDFDSVIDCNTPQDAVSRTQRRGRTGRGKPGIYRFVAPGERPS
>cath-4_2_0-1a1vA03-432-451_484-
VTQTVDFSLDPTFTIETTTLGMFDSSVLCECYDAGXAWYELTPAETTVRLRAYMNTPGLPVCQDHLEFWEGVFTGLTHIDAHFLSQTKQSGENFPYLVAYQATVCARAQAPPPSWDQMWKCLIRLKPTLHGPTPLLYRLGAVQNEVTLTHPITKYIMTCMS
>cath-4_2_0-1a1zA00-1-83
MDPFLVLLHSVSSSLSSSELTELKGLCLGRVGKRKLERVQSGLDLFSMLLEQNDLEPGHTELLRELLASLRRHDLLRRVDDFELEHHHHHH

This file now passes (with v3.1.0; with v3.3.0 it crashes due to #260)

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

milot-mirdita commented 3 years ago

In ffindex.h you can increase the following define.

#define FFINDEX_MAX_ENTRY_NAME_LENTH 32

We didn't want to change that as we are not 100% sure it won't break anything downstream. You can build a database with MMseqs2 (mmseqs createdb) and symlink the required file names for HH-suite as described in the other thread (https://github.com/soedinglab/hh-suite/issues/262#issuecomment-831439909).

apcamargo commented 3 years ago

I changed the source and did some quick tests. Found no problems so far (this doesn't mean that I encourage people to do this).

@milot-mirdita This is sort of related to this thread: I was trying to create a ffdata and ffindex files from A3M alignments using ffindex_build and I got the character limit error. Could I use mmseqs createdb here? As far as I know there's no way of creating a MSA DB from a set of A3M files using MMSeqs2 (convertmsa only allows Stockholm).

milot-mirdita commented 3 years ago

mmseqs tar2db might be one of the easiest ways to create a database like that.

Sorry, my createdb suggestion doesn't actually work, for the original problem you'd need something with mmseqs createseqfiledb.