Open susheelbhanu opened 2 years ago
400 Million is not a big problem for linclust, it should run on any reasonably sized server in a day or so. We have clustered billions of sequences with linclust previously.
However it won’t reach 30% sequence Identity, for that you will need the normal clustering workflow. That will run linclust first and then use the MMseqs2 search algorithms to cluster further. That might run for a few days/weeks.
@milot-mirdita I'm running into segmentation fault
issues when running with 2 Nodes carrying 128 cpus each - total memory == 448G. Further review revealed that i'm running out of memory.
Any recommendations on what kind of reasonable-sized server
you are referring to? Any by 'normal clustering' you mean the mmseqs cluster
?
I'm trying to reproduce the cascaded-clustering
approach described here: https://elifesciences.org/articles/67667#bib118. Which might that be?
Thank you!
Can you try to run it on a single node (without MPI, etc). Issues in MPI support might have gone unnoticed since we switched to 128 core machines.
Yes, i mean mmseqs (easy-)cluster
with normal clustering. That one should also successfully finish eventually on a single of these compute nodes.
Can you please post the full log output? Maybe something else went wrong.
Okay, will give this a go and report back. Thank you!!
@milot-mirdita Below is the log file from one of the runs. Looks like it's running out of memory, before the job dies.
And here is the job efficiency report from SLURM
Job ID: 359779
Cluster: aion
User/Group: sbusi/clusterusers
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 128
CPU Utilized: 10:13:07
CPU Efficiency: 1.00% of 42-16:44:48 core-walltime
Job Wall-clock time: 08:00:21
Memory Utilized: 206.26 GB
Memory Efficiency: 92.08% of 224.00 GB
Do you think merely providing more cores will do the trick or is there something else that I'm missing?
Thank you!
UPDATE: Tried the run with more cores, but across 6 nodes - didn't really expect it to work given your last comment, but was worth the shot
Job ID: 360184
Cluster: aion
User/Group: sbusi/clusterusers
State: OUT_OF_MEMORY (exit code 0)
Nodes: 6
Cores per node: 128
CPU Utilized: 09:26:34
CPU Efficiency: 0.23% of 172-18:08:00 core-walltime
Job Wall-clock time: 05:23:55
Memory Utilized: 1.21 TB (estimated maximum)
Memory Efficiency: 92.41% of 1.31 TB (1.75 GB/core)
Update: Managed to successfully run the clustering using a full 3 TB
node with 112 threads. The SLURM efficiency output is below:
Job ID: 2976046
Cluster: iris
User/Group: sbusi/clusterusers
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 112
CPU Utilized: 73-15:37:58
CPU Efficiency: 21.99% of 334-22:48:00 core-walltime
Job Wall-clock time: 2-23:46:30
Memory Utilized: 197.78 GB
Memory Efficiency: 6.70% of 2.88 TB
I'm running into a similar issue but with contigs. Samples with even only a handful of contigs larger than ~200,000 bp seem to crash mmseqs easy-clust because of memory (segmentation fault error). similarly getting poor memory efficiency
Job ID: 1002827 Cluster: tinkercliffs User/Group: clb21565/clb21565 State: FAILED (exit code 1) Nodes: 1 Cores per node: 128 CPU Utilized: 03:00:21 CPU Efficiency: 25.62% of 11:44:00 core-walltime Job Wall-clock time: 00:05:30 Memory Utilized: 13.43 GB Memory Efficiency: 5.59% of 240.00 GB
Hi,
I have a ~150 samples of protein fasta files, which I'd like to generate a non-redundant set of proteins from. I have tried concatenating the files to run the following:
However, the concatenated file has 400 Mio genes, making it computationally infeasible. What may be the best approach here?
re-clustering
after an initial round, is it possible to output thenon-redundant
fasta file?Thank you for your input! -Susheel