Prefilter (second round) dies during taxonomic classification with UniRef90 DB #838

sean-workman commented 6 months ago

Expected Behavior

Taxonomic classification of contigs in my metagenomic assembly using the UniRef90 database.

Current Behavior

After a first round of prefilter, rescorediagonal is executed, some merge steps are executed, new tmp directories are created, and the program dies partway through the second round of prefilter.

Steps to Reproduce (for bugs)

Downloaded the UniRef90 database with wget: wget

Decompressed with gunzip, then ran createdb: mmseqs createdb uniref90.fasta uniref90

Augmented with taxonomic information (used -db-mode 0 because createbintaxonomy kept crashing as well): mmseqs createtaxdb uniref90 tmp --tax-db-mode 0

Created database for my query sequences: mmseqs createdb KLEB_PO07_megahit.fasta KLEB_PO07_megahitDB

Ran mmseqs taxonomy on cluster with slurm script:

#!/usr/bin/env bash

#SBATCH --job-name=KLEB_PO07_mmseqs
#SBATCH --cpus-per-task=32
#SBATCH --mem=150G
#SBATCH --time=0-3:00
#SBATCH --output=KLEB_PO07_mmseqs.log
#SBATCH --error=KLEB_PO07_mmseqs.err

module load mmseqs2/15-6f452

mmseqs taxonomy KLEB_PO07_megahitDB $taxDB KLEB_PO07_megahit_result tmp

MMseqs Output (for bugs)

Full output can be found in this gist.

I also see this output in my error file: tmp/1193166584733320518/tmp_taxonomy/17149912652888480377/tmp_hsp1/10699950925961740214/ line 135: 8379 Bus error (core dumped) $RUNNER "$MMSEQS" prefilter "$INPUT" "$TARGET" "$TMP_PATH/pref_$STEP" $PREFILTER_PAR -s "$SENS"


I created metagenomic assemblies using megahit and metaSPAdes. I am trying to get MMseqs2 working to do taxonomic classification. I am running on Digital Research Alliance of Canada clusters.

Your Environment

Your Environment

I ran lscpu on a login node and got what is shown below, but the memory and CPUs that I had for the job were specified in the slurm job script shown above.

sean-workman commented 6 months ago

I tried re-running with 250 GB RAM requested and 32 threads specified. It is now telling me it would need 717 G??

Create directory tmp
taxonomy KLEB_PO07_megahitDB /home/sdwork/scratch/metagenomics/uniref_db/uniref90 KLEB_PO07_megahit_result tmp --threads 32

sean-workman commented 6 months ago

Trying with the easy-taxonomy workflow got me further, but after two rounds of prefiltering I ended up getting:

Error: Lca died
Error: taxonomy died
Error: Search died

Full MMseqs2 output logfile is here

The gdb output says:

Core was generated by `mmseqs lca /home/sdwork/scratch/metagenomics/uniref_db/uniref90 ez_tmp/88137780'.
Program terminated with signal SIGBUS, Bus error.
#0  0x00002adfa621cbf4 in ?? () from /cvmfs/
sean-workman commented 5 months ago

I managed to solve my own problem and it ended up being something very silly.

When using the easy-taxonomy workflow and getting to:

Error: Lca died
Error: taxonomy died
Error: Search died

My error output showed that my DB_mapping was empty. It was was empty because the awk command in the that populates it didn't find any matches between the DB.lookup and taxidmapping. This is because the UniProt IDs in the DB.lookup were prepended with UniRef90_. I guess if I used the full databases workflow that might have been removed, but because I needed to do things manually due to working on a cluster where compute nodes have no internet connection it wasn't.

Things are working great now! Thanks for this software!

milot-mirdita commented 5 months ago

Sorry, didn't get around to look at this. Glad it works now. The "intended" way to do this, would have been to you the databases workflow to download and create the database.

It has its own handling of uniref (and uniprot) based headers, and should be generally slightly better, since it directly uses the information in the header, instead of going through the idmapping.

This is the code it executes to make the _mapping:

afterwards createtaxdb is called setup the _taxonomy, which basically contains the NCBI taxdump.

sean-workman commented 5 months ago

No worries! Always a good exercise to figure things out myself. I'm sure you're very busy and this was a problem of my own making by not using the intended workflow. I did try to use the databases workflow initially but unfortunately the login nodes that have connection to the internet on the cluster I am using don't have the resources to deal with the size of the databases I wanted to use.

In the future I'll look to find a better workaround. With metabuli I just downloaded the pre-built database. I don't know if the resources for this are available but perhaps it would be worthwhile to do a similar thing here? Either way, thanks again for providing this excellent resource and good luck with CASP16! :)

milot-mirdita commented 5 months ago

We have also moved to prebuilt dbs for foldseek. I don't think we would be able to keep up with the two month release cycles of the uniref/uniprot though, so probably no prebuilt databases for MMseqs2.

Thanks a lot!