Seeking input on clustering multiple samples

susheelbhanu commented 2 years ago

Hi,

I have a ~150 samples of protein fasta files, which I'd like to generate a non-redundant set of proteins from. I have tried concatenating the files to run the following:

mmseqs easy-linclust --cov-mode 0 -c 0.8 --min-seq-id 0.3 all_nomis_proteins.faa mmseqs2_output tmp

However, the concatenated file has 400 Mio genes, making it computationally infeasible. What may be the best approach here?

cluster smaller sets of samples and then re-cluster?
for point-1 above, can one provide multiple inputs or should concatenate the smaller sets?
for re-clustering after an initial round, is it possible to output the non-redundant fasta file?

Thank you for your input! -Susheel

milot-mirdita commented 2 years ago

400 Million is not a big problem for linclust, it should run on any reasonably sized server in a day or so. We have clustered billions of sequences with linclust previously.

However it won’t reach 30% sequence Identity, for that you will need the normal clustering workflow. That will run linclust first and then use the MMseqs2 search algorithms to cluster further. That might run for a few days/weeks.

susheelbhanu commented 2 years ago

@milot-mirdita I'm running into segmentation fault issues when running with 2 Nodes carrying 128 cpus each - total memory == 448G. Further review revealed that i'm running out of memory.

Any recommendations on what kind of reasonable-sized server you are referring to? Any by 'normal clustering' you mean the mmseqs cluster?

I'm trying to reproduce the cascaded-clustering approach described here: https://elifesciences.org/articles/67667#bib118. Which might that be?

Thank you!

milot-mirdita commented 2 years ago

Can you try to run it on a single node (without MPI, etc). Issues in MPI support might have gone unnoticed since we switched to 128 core machines.

Yes, i mean mmseqs (easy-)cluster with normal clustering. That one should also successfully finish eventually on a single of these compute nodes.

Can you please post the full log output? Maybe something else went wrong.

susheelbhanu commented 2 years ago

Okay, will give this a go and report back. Thank you!!

susheelbhanu commented 2 years ago

@milot-mirdita Below is the log file from one of the runs. Looks like it's running out of memory, before the job dies.

chunk00_clustering_stdout.log

And here is the job efficiency report from SLURM

Job ID: 359779
Cluster: aion
User/Group: sbusi/clusterusers
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 128
CPU Utilized: 10:13:07
CPU Efficiency: 1.00% of 42-16:44:48 core-walltime
Job Wall-clock time: 08:00:21
Memory Utilized: 206.26 GB
Memory Efficiency: 92.08% of 224.00 GB

Do you think merely providing more cores will do the trick or is there something else that I'm missing?

Thank you!

UPDATE: Tried the run with more cores, but across 6 nodes - didn't really expect it to work given your last comment, but was worth the shot

Job ID: 360184
Cluster: aion
User/Group: sbusi/clusterusers
State: OUT_OF_MEMORY (exit code 0)
Nodes: 6
Cores per node: 128
CPU Utilized: 09:26:34
CPU Efficiency: 0.23% of 172-18:08:00 core-walltime
Job Wall-clock time: 05:23:55
Memory Utilized: 1.21 TB (estimated maximum)
Memory Efficiency: 92.41% of 1.31 TB (1.75 GB/core)

susheelbhanu commented 2 years ago

Update: Managed to successfully run the clustering using a full 3 TB node with 112 threads. The SLURM efficiency output is below:

Job ID: 2976046
Cluster: iris
User/Group: sbusi/clusterusers
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 112
CPU Utilized: 73-15:37:58
CPU Efficiency: 21.99% of 334-22:48:00 core-walltime
Job Wall-clock time: 2-23:46:30
Memory Utilized: 197.78 GB
Memory Efficiency: 6.70% of 2.88 TB

clb21565 commented 2 years ago

I'm running into a similar issue but with contigs. Samples with even only a handful of contigs larger than ~200,000 bp seem to crash mmseqs easy-clust because of memory (segmentation fault error). similarly getting poor memory efficiency

Job ID: 1002827 Cluster: tinkercliffs User/Group: clb21565/clb21565 State: FAILED (exit code 1) Nodes: 1 Cores per node: 128 CPU Utilized: 03:00:21 CPU Efficiency: 25.62% of 11:44:00 core-walltime Job Wall-clock time: 00:05:30 Memory Utilized: 13.43 GB Memory Efficiency: 5.59% of 240.00 GB

soedinglab / MMseqs2

Seeking input on clustering multiple samples #608