Prefilter don't use all the memory available

jpjarnoux commented 3 years ago

Expected Behavior

I'm giving to MMSeqs2 360G of memory with 36 threads. I was expecting that it will use all this memory to increase the prefiltering speed.

Current Behavior

During the prefiltering step, MMSeqs2 used only 10% of the memory. I think he is using 80% of the estimated memory consumption, because it compute 48G for the Estimated memory consumption, but I'm not sure.

Context

I'm currently trying to create protein families with a large database. I have 50 million proteins to cluster. I write a workflow based on uniclust. I would to increase the speed of the workflow, particularly on prefilter step, without loose the sensibility. I tried to use --split-memory-limit to increase the memory but it's not working. There is another solution ?

Here it's the command line that I would to increase speed : mmseqs search $INPUT $INPUT "$ALIGN_DIR/align" $TMP_DIR --max-seqs 1000 -c 0.8 --comp-bias-corr 1 -s 7 --alignment-mode 3 --min-seq-id 0.3 --threads 36

Your Environment

Architecture :        x86_64
Processeur(s) :       36
Thread(s) par cœur : 1
Cœur(s) par socket : 18
Socket(s) :           2
Nœud(s) NUMA :       2
Identifiant constructeur : GenuineIntel
Famille de processeur : 6
Modèle :             85
Nom de modèle :      Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Révision :           4

Thanks for your work

milot-mirdita commented 3 years ago

The database is not large enough to use 300GB of RAM (See https://github.com/soedinglab/MMseqs2/wiki#memory-consumption). So it would be expected to use far less. However, if MMseqs2 was only uses 30GB of 300GB, then that would be weird. Could you post the full log?

jpjarnoux commented 3 years ago

It is very long so I prefer to give to you all the log in a file. You can find the search step from line 723 or below, but I prefer to give to you everything.

prefilter /env/cns/bigtmp2/PANFAM/PipelineProteome//CLUST/PANFAM80/panfam_subDB /env/cns/bigtmp2/PANFAM/PipelineProteome//CLUST/PANFAM80/panfam_subDB /env/cns/bigtmp2/PANFAM/PipelineProteome//ALIGN/635041581728617992/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 36 --compressed 0 -v 3 -s 7.0 

Query database size: 12187255 type: Aminoacid
Estimated memory consumption: 42G
Target database size: 12187255 type: Aminoacid
Index table k-mer threshold: 100 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 12.19M 26s 362ms
Index table: Masked residues: 43826477
Index table: fill
[=================================================================] 12.19M 38s 306ms
Index statistics
Entries:          3083105370
DB size:          18129 MB
Avg k-mer size:   48.173521
Top 10 k-mers
    GPGGTL  40332
    GQQVAR  22194
    GEGGVV  20313
    NAIAAG  18525
    YTGTPK  18522
    ALAIAR  16978
    GFVAVR  15587
    GPGGTT  14728
    GEGGTL  13758
    LAMHRT  13125
Time for index table init: 0h 1m 7s 827ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 100
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 12187255
Target db start 1 to 12187255
[======

pipeline.log Thanks

milot-mirdita commented 3 years ago

Everything seems to be going alright. As this is a later clustering step, fewer sequences (~12M) remain to be clustered. The memory requirement goes down with the number of sequences remaining.

jpjarnoux commented 3 years ago

Okay I was thinking this too but I would to be sure there was no problem.

So there is nothing to do to win time at this step, except changing the sensitivity or the max-seqs ?

milot-mirdita commented 3 years ago

What parameters did you use (the log doesn't show the call to mmseqs (easy-)cluster)? What MMseqs2 version/commit is this (please compile from a git checkout if you compile from source, not by downloading the tar.gz/zip)?

It seems like you are using the single step clustering, that should be much slower than the cascaded clustering.

jpjarnoux commented 3 years ago

At this step I just want to use search and use the result to cluster at 30% identity and 50% (after filter) like uniclust.

I'm currently using MMSeqs2 version 12.git113e321

When I use cluster I don't use the argument --single-step-clustering, so I think I'm doing a cascaded clustering.

milot-mirdita commented 3 years ago

I am not sure what exactly you are running currently. Could you make a list of all MMseqs2 commands you are running or link to the script you are running?

Using the Uniclust pipeline doesn't really make sense anymore, since it's extremely slow. You should use multiple separate mmseqs cluster calls with --cluster-reassign at different --min-seq-id levels.

jpjarnoux commented 3 years ago

Sorry for the delay, I had to work on another project.

The first step is to remove the fragments :

prefilter at --max-seqs 4000 --min-ungapped-score 100 --comp-bias-corr 0 -s 1
rescorediagonal --min-seq-id 0.9 -c 0.95 --cov-mode 1
cluster --cluster-mode 2
createsubdb
clusthash --min-seq-id 0.9
cluster --cluster-mode 2
createsubdb
createsubdb
filterdb
align -c 0.9 --alignment-mode 2 --min-seq-id 0.9 --comp-bias-corr 0
cluster --cluster-mode 2
createsubdb

The second step is to create family at 80% covery and 80% of identity :

mmseqs cluster --max-seqs 300 -c 0.8 --comp-bias-corr 1 -s 4 --kmer-per-seq 80 --alignment-mode 2 --min-seq-id 0.8 -e 0.001 --max-seq-len 32768 --max-rejected 2147483647 --cluster-mode 0
mmseqs createsubdb

The next step consists in aligning at 30% of identity and 80% of coverage :

mmseqs search --max-seqs 1000 -c 0.8 --comp-bias-corr 1 -s 7 --alignment-mode 3 --min-seq-id 0.3
mmseqs createsubdb

To finish I cluster at 50% of identity and 30% of identity :

mmseqs filterdb --filter-column 3 --filter-regex '(0\.[5-9][0-9]{2}|1\.000)'
mmseqs clust cluster-mode 0 (at 50%)
mmseqs clust cluster-mode 0 (at 30%)

I hope you understand and you could help me to improve it.

Thanks

soedinglab / MMseqs2