soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

How to avoid getting multiple split databases #644

Closed tnn111 closed 1 year ago

tnn111 commented 2 years ago

Hi,

I'm trying to use the taxonomy feature and when I do, my output DB seems to be split in many smaller DBs. Is there any way to control this split? I'd like to just turn it off. I have 1 TB of memory so I shouldn't have problems.

Other than that, this works great!

milot-mirdita commented 2 years ago

You can set the MMSEQS_FORCE_MERGE environment variable (e.g. export MMSEQS_FORCE_MERGE=1). The split databases are, however, an IO optimization and not related to memory. Merging after every module invocation can slow MMseqs2 down considerably.

tnn111 commented 2 years ago

Is there a way of merging them after the run is done? It’s not a big deal; it’s just a little less cluttered.

I really appreciate the software. I’ve been using the taxonomy module extensively with impressive results. Thank you!

On Dec 3, 2022, at 20:39, Milot Mirdita @.***> wrote:

You can set the MMSEQS_FORCE_MERGE environment variable (e.g. export MMSEQS_FORCE_MERGE=1). The split databases are, however, an IO optimization and not related to memory. Merging after every module invocation can slow MMseqs2 down considerably.

— Reply to this email directly, view it on GitHub https://github.com/soedinglab/MMseqs2/issues/644#issuecomment-1336320279, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMXPRTD22DUCKCKPGRBEPLWLQOAZANCNFSM6AAAAAASRJ3OC4. You are receiving this because you authored the thread.

PabloOfEpidemiology commented 1 year ago

I'm having the same problem with the linclust command. I get many DB files, perhaps because the original dataset that I am clusterising is huge (16 million SARS sequences). I wonder, if there is a way to merge them post-alignment/linclust?

milot-mirdita commented 1 year ago

easy-linclust will merge the results into easily processable .tsv files. You should use the linclust workflow only if you want to process the MMseqs2 internal database formats with other MMseqs2 modules.