soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

Run it with an already downloaded BLAST database #395

Closed pauldeboissier closed 3 years ago

pauldeboissier commented 3 years ago

Dear colleagues,

I'm Paul DE BOISSIER, PhD student at the IBDM, in Marseille, France. I'm developing a pipeline in which the first step is an orthologous search. For that, I already implemented a RBH search and an access to OrthoDB. My problem is that OrthoDB is not quite complete and the RBH took a lot of time to run. I discussed with my supervisor, Bianca Habermann, and we think that using MMseqs2 can help us a lot to reduce our running time.

So, I have deeply read the documentation but maybe I missed something. My pipeline is running with refseq as the main database, especially for the RBH, with all the files in .pal, .pos,... My problem is that I want to use my already existing blast database with MMseqs2 but I don't find any options to create the MMseqs2 database from such database. Obviously, I read about the function "databases" which downloads the database, and I can use NR or UniProtKB but I don't want to redownload a whole database as I have not a lot of space in our servers, especially as we share it among the team.

Do you know how I can manage it please ?

Best.

DE BOISSIER Paul PhD Student - Computational Biology Group IBDM – Institut de Biologie du Développement de Marseille paul.de-boissier@univ-amu.fr

milot-mirdita commented 3 years ago

Theoretically you can extract a FASTA file from an existing BLAST DB (see https://github.com/soedinglab/MMseqs2/wiki#create-a-seqtaxdb-from-an-existing-blast-database). With a FASTA file you can build a MMseqs2 database. MMseqs2 will still need to build it's own database, meaning you will still need to have enough storage space for it. We don't/can't support BLAST databases.

etowahadams commented 3 years ago

You'll have to create a fasta file from your existing blast database (using the instructions linked in the previous comment) no matter what. For the NR database, the fasta file for the entire database is about 160 GB. If you index the NR database (mmseqs createindex), it'll take up an additional 950 GB in my experience.