soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.41k stars 195 forks source link

mmseqs taxonomy: works with batching? #168

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

This is a question, not an issue, but I can't find the info on the wiki, and it would be good to know before taking the time to write all of the code.

I'm running mmseqs taxonomy (mmseqs version 7.4e23d) on ~1 million sequences with uniclust90_2018_08 as the database, and even with 24 threads, the job takes ~48 hours (& almost 300Gb of memory). I'd like to speed this up if possible, so I was thinking of batching the reads for parallel runs on a compute cluster. I've previously run into some software that runs into problems if multiple jobs are using the same database at the same time. Do you know if this is the case for mmseqs taxonomy?

I'm guessing that each batch job will still require 100's of Gb of memory (the DB size will still be the same), but hopefully I can at least increase the speed of the overall job.

milot-mirdita commented 5 years ago

Parallelization across compute nodes should work without issue by using MPI/OpenMP hybrid parallelization: https://github.com/soedinglab/MMseqs2/wiki#how-to-run-mmseqs2-on-multiple-servers-using-mpi

(Setting the RUNNER environment variable with the mpirun envocation)

Regarding the memory usage: MMseqs2 will use as much memory as the compute node has available. You can force it to split the target database into chunks to reduce the peak memory usage at the cost of slight increase in run time. Use either the --split or the --split-memory-limit for that.

nick-youngblut commented 5 years ago

Thanks for the help! Sadly, I'm probably not going to be able to use MPI. Instead, I'll have to partition the input into batches and run each batch as a separate cluster job. I'm now trying figure out the best way to due this, given the input files required:

For mmseqs taxonomy, I'm pretty sure that I need the following input files:

Is there an easy way of splitting the db and all of these db-associated files into equal partitions? I guess that I could convert the db to a fasta, split that into equal partitions, and then convert each partition into an mmseqs db (with all associated files), but I'm guessing that there's an easier way.

martin-steinegger commented 5 years ago

@nick-youngblut there is a splitdb module in MMseqs2. So you could split your query database like this:

  mmseqs splitdb inputDb inputDbSplitted --split 2
  ln -s inputDb_h inputDbSplitted_0_2_h 
  ln -s inputDb_h.index inputDbSplitted_0_2_h.index
  ln -s inputDb_h.dbtype inputDbSplitted_0_2_h.dbtype
  ln -s inputDb_h inputDbSplitted_1_2_h 
  ln -s inputDb_h.index inputDbSplitted_1_2_h.index
  ln -s inputDb_h.dbtype inputDbSplitted_1_2_h.dbtype
nick-youngblut commented 5 years ago

Thanks! That worked for splitting the database; however, I ran into an issue when running mmseqs taxonomy on each split. The error is:

Error: Prefilter died
Error: First search died
Invalid database read for database data file={SPLIT#_OF_MY_SEQ_DATABASE}
milot-mirdita commented 5 years ago

Could you post the full log please?

nick-youngblut commented 5 years ago

Here's a log from one of the 10 batch jobs: mmseqs_taxonomy.log

I was using the following paramters: -e 1e-5 --max-seqs 200 --lca-ranks "kingdom:phylum:class:order:family:genus:species" --split 4 --threads 12

nick-youngblut commented 5 years ago

Any updates on this? I've tried a couple of things, but I still got the "Invalid database read for database data file" error. Do I have to somehow subset the _h and .index files in addition to the database file?

nick-youngblut commented 5 years ago

I tried using mmseqs createsubdb to create _h and _h.index files for each split, but that didn't help. I still got the error:

Invalid database read for database data file={DB FILE}
Error: Prefilter died
Error: First search died
martin-steinegger commented 5 years ago

Sorry for the late answer. Milot and I have some deadlines approaching soon.

I could not reproduce the error. The error indicates that MMseqs2 tries to access an out of range offset in the data file /tmp/global2/nyoungblut/LLMGAG_27929269397/clusters_rep-seqs_db_3_1. Could you please check the size of this file? Is there any entry in the second column of /tmp/global2/nyoungblut/LLMGAG_27929269397/clusters_rep-seqs_db_3_1.index that is greater than the data file size?

nick-youngblut commented 5 years ago

It turns out that the error was caused by me symlinking the full db .index file along with all of the files that I was symlinking. I re-ran mmseqs taxonomy on the splits without symlinking the .index file, and everything worked. Sorry to waste your time on this.