Closed nick-youngblut closed 5 years ago
Parallelization across compute nodes should work without issue by using MPI/OpenMP hybrid parallelization: https://github.com/soedinglab/MMseqs2/wiki#how-to-run-mmseqs2-on-multiple-servers-using-mpi
(Setting the RUNNER environment variable with the mpirun envocation)
Regarding the memory usage: MMseqs2 will use as much memory as the compute node has available. You can force it to split the target database into chunks to reduce the peak memory usage at the cost of slight increase in run time. Use either the --split
or the --split-memory-limit
for that.
Thanks for the help! Sadly, I'm probably not going to be able to use MPI. Instead, I'll have to partition the input into batches and run each batch as a separate cluster job. I'm now trying figure out the best way to due this, given the input files required:
For mmseqs taxonomy
, I'm pretty sure that I need the following input files:
Is there an easy way of splitting the db and all of these db-associated files into equal partitions? I guess that I could convert the db to a fasta, split that into equal partitions, and then convert each partition into an mmseqs db (with all associated files), but I'm guessing that there's an easier way.
@nick-youngblut there is a splitdb
module in MMseqs2. So you could split your query database like this:
mmseqs splitdb inputDb inputDbSplitted --split 2
ln -s inputDb_h inputDbSplitted_0_2_h
ln -s inputDb_h.index inputDbSplitted_0_2_h.index
ln -s inputDb_h.dbtype inputDbSplitted_0_2_h.dbtype
ln -s inputDb_h inputDbSplitted_1_2_h
ln -s inputDb_h.index inputDbSplitted_1_2_h.index
ln -s inputDb_h.dbtype inputDbSplitted_1_2_h.dbtype
Thanks! That worked for splitting the database; however, I ran into an issue when running mmseqs taxonomy
on each split. The error is:
Error: Prefilter died
Error: First search died
Invalid database read for database data file={SPLIT#_OF_MY_SEQ_DATABASE}
Could you post the full log please?
Here's a log from one of the 10 batch jobs: mmseqs_taxonomy.log
I was using the following paramters: -e 1e-5 --max-seqs 200 --lca-ranks "kingdom:phylum:class:order:family:genus:species" --split 4 --threads 12
Any updates on this? I've tried a couple of things, but I still got the "Invalid database read for database data file" error. Do I have to somehow subset the _h and .index files in addition to the database file?
I tried using mmseqs createsubdb
to create _h and _h.index files for each split, but that didn't help. I still got the error:
Invalid database read for database data file={DB FILE}
Error: Prefilter died
Error: First search died
Sorry for the late answer. Milot and I have some deadlines approaching soon.
I could not reproduce the error. The error indicates that MMseqs2 tries to access an out of range offset in the data file /tmp/global2/nyoungblut/LLMGAG_27929269397/clusters_rep-seqs_db_3_1
. Could you please check the size of this file? Is there any entry in the second column of /tmp/global2/nyoungblut/LLMGAG_27929269397/clusters_rep-seqs_db_3_1.index
that is greater than the data file size?
It turns out that the error was caused by me symlinking the full db .index file along with all of the files that I was symlinking. I re-ran mmseqs taxonomy
on the splits without symlinking the .index file, and everything worked. Sorry to waste your time on this.
This is a question, not an issue, but I can't find the info on the wiki, and it would be good to know before taking the time to write all of the code.
I'm running
mmseqs taxonomy
(mmseqs version 7.4e23d) on ~1 million sequences with uniclust90_2018_08 as the database, and even with 24 threads, the job takes ~48 hours (& almost 300Gb of memory). I'd like to speed this up if possible, so I was thinking of batching the reads for parallel runs on a compute cluster. I've previously run into some software that runs into problems if multiple jobs are using the same database at the same time. Do you know if this is the case formmseqs taxonomy
?I'm guessing that each batch job will still require 100's of Gb of memory (the DB size will still be the same), but hopefully I can at least increase the speed of the overall job.