soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.48k stars 200 forks source link

Cannot create index (Error: indexdb died) for Uniclust30 profile database generated using convertprofiledb #130

Closed salvoc81 closed 5 years ago

salvoc81 commented 6 years ago

Expected Behavior

Precompute mmseqs index tables are generate using createindex

Current Behavior

Fails after a few minutes of computation with the following error message: indexdb died

Steps to Reproduce (for bugs)

Download the Uniclust30 database (August 2018) wget --verbose http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz

Extract the downloaded archive

Convert the hhm file (uniclust30_2018_08_hhm_db) to MMseqs profile DB mseqs convertprofiledb uniclust30_2018_08_hhm_db profile_from_hmm --threads 30 -v 3

Generate the index mmseqs createindex ./profile_from_hhm/profile_from_hmm ./tmp -k 5 -s 7 --threads 36 -v 3

MMseqs Output (for bugs)

> mmseqs createindex ./profile_from_hhm/profile_from_hmm ./tmp -k 5 -s 7 --threads 36 -v 3
Program call:
createindex ./profile_from_hhm/profile_from_hmm ./tmp -k 5 -s 7 --threads 36 -v 3

MMseqs Version:         6.f5a1c
Sub Matrix              blosum62.out
K-mer size              5
Alphabet size           21
Max. results per query  300
Max. sequence length    65535
Mask Residues           1
Spaced Kmer             1
Spaced k-mer pattern
Sensitivity             7
K-score                 0
Include Header          false
Split DB                0
Split Memory Limit      0
Threads                 36
Verbosity               3
Min codons in orf       30
Max codons in length    98202
Max orf gaps            2147483647
Contig start mode       2
Contig end mode         2
Orf start mode          0
Forward Frames          1,2,3
Reverse Frames          1,2,3
Translation Table       1
Use all table starts    false
Offset of numeric ids   0
Add Orf Stop            false
Remove Temporary Files  false

Tmp ./tmp folder does not exist or is not a directory.
Created dir ./tmp
Program call:
indexdb ./profile_from_hhm/profile_from_hmm ./profile_from_hhm/profile_from_hmm --sub-mat blosum62.out -k 5 --alph-size 21 --max-seqs 300 --max-seq-len 65535 --mask 1 --spaced-kmer-mode 1 -s 7 --k-score 2147483647 --include-headers 0 --split 0 --split-memory-limit 0 --threads 36 -v 3

MMseqs Version:         6.f5a1c
Sub Matrix              blosum62.out
K-mer size              5
Alphabet size           21
Max. results per query  300
Max. sequence length    65535
Mask Residues           1
Spaced Kmer             1
Spaced k-mer pattern
Sensitivity             7
K-score                 2147483647
Include Header          false
Split DB                0
Split Memory Limit      0
Threads                 36
Verbosity               3

Substitution matrices...
Use kmer size 5 and split 1 using Target split mode.
Needed memory (20178976034 byte) of total memory (486687909888 byte)
Index table: counting k-mers...

Context

Trying to generate a profile DB from the file uniclust30_2018_08_hhm_db contained the 18-08 release of Uniclust30 http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz I am using convertprofiledb and then createindex...

NOTE: I have used the same procedure to generate the profile DB using the HHblits profiles for Pfam 31 downloaded from: http://wwwuser.gwdg.de/%7Ecompbiol/data/hhsuite/databases/hhsuite_dbs/pfamA_31.0.tgz

Your Environment

MMseqs Version: 6.f5a1c f5a1cdb MMseqs was self-compiled gcc (Homebrew gcc 5.5.0_4) 5.5.0 cmake 3.12.3

Server specifications:

less /proc/cpuinfo

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 63
model name  : Intel(R) Xeon(R) CPU E5-4627 v3 @ 2.60GHz
stepping    : 2
microcode   : 0x3a
cpu MHz     : 3001.882
cache size  : 25600 KB
physical id : 0
siblings    : 10
core id     : 0
cpu cores   : 10
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 15
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bogomips    : 5199.77
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
milot-mirdita commented 6 years ago

Hi Salvatore,

The Uniclust profiles need a different strategy to search against. The default profile search only works for at most a couple of 100k profiles, after that the memory requirements explode. We are currently working on a different profile search strategy for large databases. I'll update you once its ready.

Best regards, Milot

salvoc81 commented 6 years ago

Thanks a lot @milot-mirdita . By the way do you think the HHBlit-PfamA profiles will be updated to the version 32 of Pfam anytime soon? I might consider using those...

Thanks a lot,

Salvo

milot-mirdita commented 6 years ago

Thanks for letting me know that there was an update. I just started the job, due to the irregular releases of the Pfam its not automated. If it doesn't run into any problems, we should have a new release up in a few days.

milot-mirdita commented 6 years ago

I just finished generating and uploaded the PfamA 32 db: http://gwdu111.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pfamA_32.0.tar.gz

salvoc81 commented 6 years ago

Thanks a lot Milot!

On Fri, Oct 26, 2018 at 4:13 AM Milot Mirdita notifications@github.com wrote:

I just finished generating and uploaded the PfamA 32 db:

http://gwdu111.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/pfamA_32.0.tar.gz

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/soedinglab/MMseqs2/issues/130#issuecomment-433171780, or mute the thread https://github.com/notifications/unsubscribe-auth/AMmgMRyHH3ZhAcw0xOnhzCdOVGwKu0YOks5uog1ygaJpZM4X5uqI .

gaboentropy commented 5 years ago

Does the hhblits Pfam profiles work with mmseqs?

Anyway, hhblits runs very very slowly compared to mmseqs, so, if the Pfam profiles for hhblits don't work with mmseqs, I'd suggest using the Pfam profiles generated for mmseqs instead. It works great in my hands.

milot-mirdita commented 5 years ago

The hhblits PFAM profiles work with MMseqs2. However, I compared them recently to the PFAM.full MSAs and they were about equal with more effort needed to build the database. I would recommend to stick with the workflow described in the wiki.

HHblits will however be more sensitive than MMseqs2, due to its iterative profile-profile search capabilities.

gaboentropy commented 5 years ago

Thanks Milot, I'm using what's described in the wiki, only using the Pfam-A.fasta.gz because I get results much more consistent with those obtained using HMMER with the Pfam-A HMM database (mmseqs does it in a fraction of the time, of course, for which I'm eternally grateful to you). Sorry for going off the topic here.

milot-mirdita commented 5 years ago

Happy to hear :)

I just added a small remark regarding k-mer size for the profile searches to the wiki entry (if you have enough system memory use -k 6).

If I understood correctly, then the Pfam-A.full should be closer to our pfamA HHblits database, which represents three search iteration of the seed alignments against the Uniclust.

gaboentropy commented 5 years ago

Yep. The Pfam-A.full should be closer to your HHblits database. However, it contained fewer families (yes, I know, I am surprised too) than the seed alignments (Pfam-A.fasta) last time I checked.

Yes, I'm using -k 6.

I'd like to insist suggesting that your program should report the memory and hard drive requirements in gigabytes to the user, even if it stays in bytes internally (please).

Best and thanks again.

gaboentropy commented 5 years ago

Sorry, what I meant was that the Pfam-A.seed was the one that concurs with the hmm database. The fasta one is not a multiple alignment. I was a bit mistaken because I've been working with the CDD database too (which has multiple alignments in fasta format).

Sorry for the confusion if any was caused by my comment.

milot-mirdita commented 5 years ago

@salvoc81 The PfamA HH-suite database had one broken entry that was causing hhsearch to always fail and hhblits to possibly sometimes fail. Please download it again.

martin-steinegger commented 5 years ago

Is this solved now?

salvoc81 commented 5 years ago

Hello @martin-steinegger and @milot-mirdita . Sorry if I could not test before... I have tried today to convert the HMM (Pfam 32) to profiles but I think some files are missing. The symlink to pfam_hmm.ffdata is missing, and pfam_hmm_db.index is missing.

I am not sure how to create the pfam_hmm_db.index file

salvoc81 commented 5 years ago

Following the working command with pfamA_31.0.tgz

mmseqs convertprofiledb pfam_hhm_db pfam31_hhblits_profile --threads 36 -v 3 
mmseqs createindex ./pfam31_hhblits_profile ./tmp -k 6 -s 7 --threads 36 -v 3

For version 31.0 of the package everything works fine, and search is completed correctly.

When I open the version 32 (pfamA_31.0.tgz) of the package, it does not contain the following files: pfam_hhm_db -> pfam_hhm.ffdata A symlink I can create myself, and pfam_hhm_db.index (a ~500K file which I am not sure how to create)

When I run the following command (after creating the symlink):

mmseqs convertprofiledb pfam_hhm_db pfam32_hhblits_profile --threads 36 -v 3 

it fails with the following output:

convertprofiledb pfam_hhm_db pfam31_hhblits_profile --threads 36 -v 3

MMseqs Version:     d36dea228b039f652a7d3e1c79e3e8d40df83125
Substitution matrix blosum62.out
Profile type        0
Threads             36
Compressed          0
Verbosity           3

No datafile could be found for pfam_hhm_db!

I have generated the symlinks as you suggested me, but the file pfam_hhm_db.index contained in the version 31 is not a symlink...

milot-mirdita commented 5 years ago

Sorry for the confusion. The _db.index files were meant for compatibility with HHsuite 2.x. However, we dropped support for those.

The confusing part lies in that HHblits produces a data file with the suffix .ffdata und index file with the suffix .ffindex and MMseqs2 expects the same data file without suffix and the index file with suffix .index.

You can make the following two symlinks:

ln -s pfam_hhm.ffdata pfam_hhm
ln -s pfam_hhm.ffindex pfam_hhm.index

And then call MMseqs2:

convertprofiledb pfam_hhm ...

Alternatively, I now changed this behavior in c9ac77558aa06391ead4dd95b5cf89eea715f348 to look for .ffdata and .ffindex first in convertprofiledb