soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.43k stars 195 forks source link

mmseqs easy-cluster stuck at prefilter stage for multiple days #558

Open aovergard opened 2 years ago

aovergard commented 2 years ago

Expected Behavior

Unsure

Current Behavior

Clustering of a large fasta file has been stuck at the prefilter stage for multiple days (>5 days).

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output. See attached: feb_viral-U-RVDBvCurrent_k11.txt

Context

I have been attempting to use mmseqs to cluster a large fasta file (~75GB) containing sequences which range in size from a few hundred to several thousand bp. I have run mmseqs on smaller datasets of similar composition and those successfully completed within a few days. However, this much larger dataset is stuck in prefiltering. Memory usage of my system is at ~50%, but all 32 CPU are maxed out. This is the command I invoked: /usr/bin/time mmseqs easy-cluster /work1/DB_Build/files/RVDB_DBs/feb_viral-U-RVDBvCurrent.fasta feb_viral-U-RVDBvCurrent tmp_feb --min-seq-id 0.98 -k 11 > feb_viral-U-RVDBvCurrent_k11.txt Can you suggest some changes I should make to my command in order to complete this job? Would you suggest a different clustering approach (i.e. linclust, or easy-linclust)? Also, I think the k11 is likely too stringent given the large variation in sequence sizes. Would you suggest increasing the -k flag (to 35 for example), or using the --kmer-per-seq-scale option? Thank you for any help you can provide - I appreciate it!

Your Environment

Include as many relevant details about the environment you experienced the bug in. mmseqs version: 13.45111 OS: CentOS Linux 7; CPU: 32; Mem: 200G

feb_viral-U-RVDBvCurrent_k11.txt

milot-mirdita commented 2 years ago

This was also a while ago. For your use-case I would only call easy-linclust. You won't benefit from the deeper clustering at a seq. id. threshold of 98%. That should run pretty quickly. I would leave the k-mer size at the default value.