soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.35k stars 190 forks source link

Trouble creating db based on nr #701

Closed liamfriar closed 1 year ago

liamfriar commented 1 year ago

Expected Behavior

Starting from nr which was already downloaded, generate all of the required files to run mmseqs taxonomy to assign taxonomy to metagenome assembled genomes.

Current Behavior

nr.fnaDB_mapping is empty (ls -thor reveals it is 0 bytes and mmseqs easy-taxonomy displays "nr.fnaDB_mapping is empty. Rerun createtaxdb to recreate taxonomy mapping.")

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Download nr

wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz gunzip nr.gz

Make a blastdb and diamond db of nr

makeblastdb -in nr -dbtype prot diamond makedb --in nr -d nr.dmnd

Prepare taxonomy database for mmseqs2

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz mkdir taxonomy && tar -xxvf taxdump.tar.gz -C taxonomy rm taxdump.tar.gz blastdbcmd -db nr -entry all > nr.fna blastdbcmd -db nr -entry all -outfmt "%a %T" > nr.fna.taxidmapping mmseqs createdb nr.fna nr.fnaDB && \ mmseqs createtaxdb nr.fnaDB tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file nr.fna.taxidmapping

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Context

I want to use MMseqs2 to assign taxonomy to contig-level metagenome-assembled genomes. I had previously downloaded nr as above, so hopefully the way I downloaded it works for MMseqs? I could alternatively re-download nr entirely with the mmseqs workflows, but would prefer not to do that if possible for consistency with analyses that are already done and used the version of nr that I have downloaded. I don't think versioning is actually a problem as I downloaded nr within the past couple months.

Your Environment

Include as many relevant details about the environment you experienced the bug in. I am using a conda environment created using conda install -c bioconda mmseqs2 on April 19, 2023. version: 14.7e284 (via conda list in activated environment)

liamfriar commented 1 year ago

I used mmseqs databases NR which re-downloaded NR from NCBI and prepared everything as databases which are now working fine. So, not sure why the above didn't work, but the workflow is now moving again.

spoonbender76 commented 1 year ago

Download nr

wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz gunzip nr.gz

Make a blastdb and diamond db of nr

makeblastdb -in nr -dbtype prot

It's probably because there're no taxidmapping information the way you make blastdb without -taxid_map. Read this Building a BLAST database with your (local) sequences

or you could download preformatted nr (https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz ~ https://ftp.ncbi.nlm.nih.gov/blast/db/nr.79.tar.gz the numbers may change by time) and try again. It works for me.