sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.98k stars 495 forks source link

mmseq database prep #70

Open listofdina opened 3 years ago

listofdina commented 3 years ago

Hi there, I'm trying to set up the mmseq2 search locally on my machine. When trying to run mmseqs tsv2exprofiledb uniref30_2103 uniref30_2103_db I'm getting:

Invalid Command: tsv2exprofiledb
Did you mean "mmseqs convertprofiledb"?

I'm freshly downloaded mmseq using conda. My current version is 13.45111. Is there a workaround that issue?

and in any case, thank you very much for all the effort you put into this project!

konstin commented 3 years ago

colabfold currently requires the latest git version of mmseqs, which is not part of any release yet. For the time being you therefore need to compile mmseqs from source (https://github.com/soedinglab/MMseqs2/wiki#compile-from-source-under-linux)

listofdina commented 3 years ago

hi guys, really sorry to bug you. I downloaded the databases and converted them as per instruction. when I run the command line mentioned there, I get that attached output, and the process doesn't move for more than an hour. Is it a memory thing? anything different should be done with the databses? Again thanks for all your help.

createdb ../kemp_elim/round5/alphaF/folding/154_UM_30_E162W216_bb42130-1i4n_prossed_theozime_1/154_UM_30_E162W216_bb421301i4n_prossed_theozime_1_6e29d.fasta result//qdb

Converting sequences

Time for merging to qdb_h: 0h 0m 0s 22ms
Time for merging to qdb: 0h 0m 0s 22ms
Database type: Aminoacid
Time for processing: 0h 0m 0s 163ms
Create directory result//tmp
search result//qdb /shareDB/ColabFold//uniref30_2103_db result//res result//tmp --num-iterations 3 --db-load-mode 2 -a -s 8 -e 0.1 --max-seqs 10000

prefilter result//qdb /shareDB/ColabFold//uniref30_2103_db.idx result//tmp/17146769197291514678/pref_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 8 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 10000 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 2 --pca 1 --pcb 1.5 --threads 56 --compressed 0 -v 3

Index version: 16
Generated by:  9fded60acb370d91db95ea4efbe43d5151163c8a
ScoreMatrix:  VTML80.out
Query database size: 1 type: Aminoacid
split was set to 1 but at least 2 are required. Please run with default paramerters
Estimated memory consumption: 128G
Process needs more than 113G main memory.
Increase the size of --split or set it to 0 to automatically optimize target database split.
Computed index is too large. Avoid using the index.
Target database size: 29291635 type: Aminoacid
Process prefiltering step 1 of 1

k-mer similarity threshold: 96
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1
Target db start 1 to 29291635
[=================================================================] 100.00% 1 eta -
milot-mirdita commented 3 years ago

I extended the script slightly to give more control over memory here: https://gist.github.com/milot-mirdita/67509c248746c4c774128fc84ab91b6f

I'd recommend to use the two new parameters to run without a precomputed index:

INDEX=${11:-1}
DB_LOAD_MODE="${12:-2}"

Pass 0 for both of these. Also make sure you are using the latest MMseqs2 commit, disabling the index is a new feature.

If you run with a sufficiently large query FASTA file, the index time will barely affect the total runtime.

Also make sure you are running an exclusive job as this will need a lot of RAM and other processes might negatively affect the job.

listofdina commented 3 years ago

that works. really great work. Thank you very much

listofdina commented 3 years ago

hey, Would love to understand what's the underlining cause for my problems running the original script. I'm working on a machine with 56 cores and 126GB of memory. Whenever I'm trying to run mmseq with the suggested script, it takes about an 1.5h. How come the API can do this so quickly? thanks for the response

edit: the 1.5h is for a single sequence of 253 aa.

listofdina commented 3 years ago

maybe just to give context- we're trying to fold around 100K structures, and a neat local solution is very much needed. Thanks again