soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.48k stars 200 forks source link

Very large createindex outputs. #687

Closed dm-kuba closed 1 year ago

dm-kuba commented 1 year ago

Hi,

When I run createdb; createindex on a fasta DB file, I generally observe the end result (all the generated output files together) is roughly ~10x bigger than the input fasta file. Most of it is the .idx files generated by createindex.

The only way I got mmseqs to run fast is by using db_load_mode=2, getting the entire target DB in memory at the same time.

Running mmseqs search efficiently against a large DB thus presents really large memory requirements. Is there any way around it (either currently, or planned)? E.g. searching against a compressed version of the DB?

Thank you! Kuba

martin-steinegger commented 1 year ago

Could you please explain your use case so that we can recommend a solution?

dm-kuba commented 1 year ago

Hi Martin,

We would like to copy the whole database into RAM once and run multiple queries against it. However, all the database files are too large to fully fit in memory and mmap-ing is not an option.

So, ideally, we would like to search against a compressed version of the database; is that possible?

I'm aware of the --compressed flag for createdb, but that still leaves us with the same really large .idx files that take up most of the space. Is there anything we're missing on the compression side?

Thanks, Kuba

martin-steinegger commented 1 year ago

The index can not be shrunk, if you want to allow for real-time searches. Depending on the size of your database, you could implement the same clustered MMseqs2 search workflow as implemented in ColabFold. This will reduce memory requirements massively. We plan to eventually over this workflow directly in MMseqs2.

dm-kuba commented 1 year ago

Thank you!