memory issues with mmseq2 cluster

szimmerman92 commented 5 years ago

HI, I am getting some memory errors when running the cluster module. The amount of memory I have on my cluster is 100GB and the number of threads is 8. When I run the cluster command as below

mmseqs cluster -c 0.9 --min-seq-id 1.0 ${output_folder}/all_samples_oct31_db ${output_folder}/all_seqs_clu_100 ${temp_folder} --remove-tmp-files --threads 8

I get errors "Can not allocate memory"

kmermatcher identifiers_sagata_ebi_Orfleton/linclust_out/all_samples_oct31_db identifiers_sagata_ebi_Orfleton/temp_linclust/2197966930512906334/linclust/5695683676022713592/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out - -alph-size 13 --min-seq-id 1 --kmer-per-seq 21 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 5 --split-memory-limit 0 --include-only-extendab le 0 --skip-n-repeat-kmer 0 --threads 8 --compressed 0 -v 3

Database size: 1032373897 type: Aminoacid

Estimated memory consumption 330652 MB Process file into 3 parts Generate k-mers list for 1 split Can not allocate memory Error: kmermatcher died Error: linclust died

Then when I try and maximize the amount of memory that can be used by adding the --split-memory-limit command like so

mmseqs cluster -c 0.9 --min-seq-id 1.0 ${output_folder}/all_samples_oct31_db ${output_folder}/all_seqs_clu_100 ${temp_folder} --remove-tmp-files --threads 8 --split-memory-limit 100000

I get the error

identifiers_sagata_ebi_Orfleton/temp_linclust/352147678829955415/linclust/9188580091420820903/linclust.sh: line 26: 24508 Killed $RUNNER "$MMSEQS" kmermatcher "$INPUT" "${TMP_PATH}/pref" ${KMERMATCHER_PAR}

Do you know what is wrong? Thank you very much.

Best, Sam

milot-mirdita commented 5 years ago

How many entries are there in the input database (wc -l ${output_folder}/all_samples_oct31_db.index)?

Try setting the memory-limit to about 2/3 of the available RAM (--split-memory-limit 70G).

Also unrelated, but are you sure about --min-seq-id 1.0? It will basically not be able to cluster anything except 100% identical substrings. If this is what you want you might want to also add --cov-mode 1 (See https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster).

szimmerman92 commented 5 years ago

Thank you for answering so quickly! I have 1,032,373,897 entries in the input database. And setting the memory to 70G worked perfectly. It's now running. I will also look into setting cod-mode to 1. I am running the clustering with several iterations (100%, 95%, 70% .. etc). So doing it with 100% identity will not be my final result. Thank you for the very helpful advice.

One thing I noticed is that, without using the split-memory-limit option the database size was 330.652 MB so the program was splitting the database into 3 files, where it probably should have been splitting it into 4 so no file would be greater than 100G. Could this be a simple error of rounding down instead of rounding up?

Thanks again.

milot-mirdita commented 5 years ago

It's a bit more complicated. We need to keep 16 byte per index entry in memory. Normally this is quite small. However with 1 billion entries this becomes quite considerable and our normal assumptions about how much RAM the core algorithms can use don't work out well anymore.

We will have to introduce something that tracks how much RAM the housekeeping data will need and adjust --split-memory-limit accordingly. We just didn't get around to that yet.

You might also want to look at the cascaded clustering stuff in the wiki: https://github.com/soedinglab/MMseqs2/wiki#cascaded-clustering https://github.com/soedinglab/MMseqs2/wiki#how-to-manually-cascade-cluster

soedinglab / MMseqs2

memory issues with mmseq2 cluster #238