soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

RAM Consumption #808

Open pbelmann opened 8 months ago

pbelmann commented 8 months ago

Hi, am using mmseqs search and I want to estimate the peak RAM consumption for my uniref90 database. I found a formula for the prefiltering step in your wiki and noticed that either your example or your formula is not incorrect:

M = (7 * N * L + 8 * a^k) byte
where
   N = Number of sequences
   L = Average sequence length
   a = alphabet
   k = k-mer size

You wrote that using a UniProtKB database with 55 million sequences with an average length of 350 requires about 71 GB of RAM. However, when I calculate just the first term of your formula (7NL), I get 134.75 GB. Can you tell me if the formula or the example is wrong?

milot-mirdita commented 5 months ago

I think we have computed the wrong thing at some point and never updated the number. 130GB sounds about right.

The function that computes a more accurate memory estimate can be found in the code: https://github.com/soedinglab/MMseqs2/blob/d4841a8efad066e9758b6626cc64c5ef5ee53055/src/prefiltering/Prefiltering.cpp#L1069

You will still find the same two parts as listed above. However, the largest chunk of memory that is used now with modern machines is the per thread memory. A dual socket 64-core CPU machine with hyper threading will try to use about 500GB total RAM in per-thread memory. Thus, its usually a good idea to not use hyperthreading with MMseqs2 as it only has minor speed benefits for a large increase of memory.

uros-sipetic commented 5 months ago

Hi may ask a similar question, but I am using easy-linclust, i.e. my command line is: mmseqs easy-linclust input.fasta clusterResult tmp --min-seq-id 0.95 -c 0.95 Do I use the same formula to calculate RAM usage? My FASTA is 1TB in size, has 2B sequences, average sequence length is 650 base pairs, and the alphabet is 4. I tried running this on an instance with 200GB of RAM, and it failed after more than 4 days, I assume that's where a step in the pipeline comes that requires more RAM, but it's expensive for me to keep trying different setups (both money wise and time wise). Can I use (or should I use) the --split-memory-limit option here as well..? These were my instance resources for the run that I described Screenshot 2024-05-09 at 12 21 50