Open jpjarnoux opened 3 years ago
The database is not large enough to use 300GB of RAM (See https://github.com/soedinglab/MMseqs2/wiki#memory-consumption). So it would be expected to use far less. However, if MMseqs2 was only uses 30GB of 300GB, then that would be weird. Could you post the full log?
It is very long so I prefer to give to you all the log in a file. You can find the search step from line 723 or below, but I prefer to give to you everything.
prefilter /env/cns/bigtmp2/PANFAM/PipelineProteome//CLUST/PANFAM80/panfam_subDB /env/cns/bigtmp2/PANFAM/PipelineProteome//CLUST/PANFAM80/panfam_subDB /env/cns/bigtmp2/PANFAM/PipelineProteome//ALIGN/635041581728617992/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 36 --compressed 0 -v 3 -s 7.0
Query database size: 12187255 type: Aminoacid
Estimated memory consumption: 42G
Target database size: 12187255 type: Aminoacid
Index table k-mer threshold: 100 at k-mer size 6
Index table: counting k-mers
[=================================================================] 12.19M 26s 362ms
Index table: Masked residues: 43826477
Index table: fill
[=================================================================] 12.19M 38s 306ms
Index statistics
Entries: 3083105370
DB size: 18129 MB
Avg k-mer size: 48.173521
Top 10 k-mers
GPGGTL 40332
GQQVAR 22194
GEGGVV 20313
NAIAAG 18525
YTGTPK 18522
ALAIAR 16978
GFVAVR 15587
GPGGTT 14728
GEGGTL 13758
LAMHRT 13125
Time for index table init: 0h 1m 7s 827ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 100
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 12187255
Target db start 1 to 12187255
[======
pipeline.log Thanks
Everything seems to be going alright. As this is a later clustering step, fewer sequences (~12M) remain to be clustered. The memory requirement goes down with the number of sequences remaining.
Okay I was thinking this too but I would to be sure there was no problem.
So there is nothing to do to win time at this step, except changing the sensitivity or the max-seqs ?
What parameters did you use (the log doesn't show the call to mmseqs (easy-)cluster
)? What MMseqs2 version/commit is this (please compile from a git checkout if you compile from source, not by downloading the tar.gz/zip)?
It seems like you are using the single step clustering, that should be much slower than the cascaded clustering.
At this step I just want to use search and use the result to cluster at 30% identity and 50% (after filter) like uniclust.
I'm currently using MMSeqs2 version 12.git113e321
When I use cluster I don't use the argument --single-step-clustering, so I think I'm doing a cascaded clustering.
I am not sure what exactly you are running currently. Could you make a list of all MMseqs2 commands you are running or link to the script you are running?
Using the Uniclust pipeline doesn't really make sense anymore, since it's extremely slow.
You should use multiple separate mmseqs cluster
calls with --cluster-reassign
at different --min-seq-id
levels.
Sorry for the delay, I had to work on another project.
The first step is to remove the fragments :
prefilter at --max-seqs 4000 --min-ungapped-score 100 --comp-bias-corr 0 -s 1
rescorediagonal --min-seq-id 0.9 -c 0.95 --cov-mode 1
cluster --cluster-mode 2
createsubdb
clusthash --min-seq-id 0.9
cluster --cluster-mode 2
createsubdb
createsubdb
filterdb
align -c 0.9 --alignment-mode 2 --min-seq-id 0.9 --comp-bias-corr 0
cluster --cluster-mode 2
createsubdb
The second step is to create family at 80% covery and 80% of identity :
mmseqs cluster --max-seqs 300 -c 0.8 --comp-bias-corr 1 -s 4 --kmer-per-seq 80 --alignment-mode 2 --min-seq-id 0.8 -e 0.001 --max-seq-len 32768 --max-rejected 2147483647 --cluster-mode 0
mmseqs createsubdb
The next step consists in aligning at 30% of identity and 80% of coverage :
mmseqs search --max-seqs 1000 -c 0.8 --comp-bias-corr 1 -s 7 --alignment-mode 3 --min-seq-id 0.3
mmseqs createsubdb
To finish I cluster at 50% of identity and 30% of identity :
mmseqs filterdb --filter-column 3 --filter-regex '(0\.[5-9][0-9]{2}|1\.000)'
mmseqs clust cluster-mode 0
(at 50%)mmseqs clust cluster-mode 0
(at 30%)I hope you understand and you could help me to improve it.
Thanks
Expected Behavior
I'm giving to MMSeqs2 360G of memory with 36 threads. I was expecting that it will use all this memory to increase the prefiltering speed.
Current Behavior
During the prefiltering step, MMSeqs2 used only 10% of the memory. I think he is using 80% of the estimated memory consumption, because it compute 48G for the
Estimated memory consumption
, but I'm not sure.Context
I'm currently trying to create protein families with a large database. I have 50 million proteins to cluster. I write a workflow based on uniclust. I would to increase the speed of the workflow, particularly on prefilter step, without loose the sensibility. I tried to use --split-memory-limit to increase the memory but it's not working. There is another solution ?
Here it's the command line that I would to increase speed :
mmseqs search $INPUT $INPUT "$ALIGN_DIR/align" $TMP_DIR --max-seqs 1000 -c 0.8 --comp-bias-corr 1 -s 7 --alignment-mode 3 --min-seq-id 0.3 --threads 36
Your Environment
Thanks for your work