soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 199 forks source link

Error: "invalid database read for database" for some, but not all, files. #35

Closed gmteunisse closed 7 years ago

gmteunisse commented 7 years ago

I'm trying to create clusters of a number of pre-clustered databases, all stored in multi-fasta files. The createdb command seems to runs fine on all files, however, the cluster command leads to an error on some, but not all, files:

Invalid database read for database data file=tmp_clusters/DB, database index=tmp_clusters/DB.index
getData: local id (27) >= db size (27)
Rescore diagonals.
Could not open data file tmp_clusters/tmp/linclust/pref!
Could not open data file tmp_clusters/tmp/linclust/pref_rescore1!
awk: can't open file tmp_clusters/tmp/linclust/pre_clust.index
 source line number 1

This error is followed by many other errors, all related to being unable to open data files (e.g. Could not open data file tmp_clusters/tmp/linclust/pref!).

The error seems to depend on the sequences in the file. For example, if I merge two of the fasta-files, one of which runs without errors and one which leads to above error, the error is reproduced. I have not been able to identify which feature in the sequences leads to the error. Before every run of mmseqs, I delete all temp files.

Regarding my environment:

I've attached two files, one that leads to errors in my build, and one that does not. Hope you can help.

DB_errors.txt DB_no_errors.txt

martin-steinegger commented 7 years ago

This error is occures because to the amount of sequences in the set. If there are more threads than sequences the kmermatcher module will fail. This should be fixed at the current version a357368b3e336a6a42772fb085fbc141df50b8ac. Could you please checkout the most recent version and rerun it?

gmteunisse commented 7 years ago

Updating seems to have fixed the problem. Thank you for your quick and effective response!