Estimated running time for createdb

soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite

https://mmseqs.com

MIT License

1.47k stars 200 forks source link

Estimated running time for createdb #495

Closed yonghanyu closed 1 year ago

yonghanyu commented 3 years ago

Hi, there

I am currently using mmseqs to cluster more than 20 billion protein sequences. I intend to complete the task by running created, clusthash and linclust module. However, the createdb module (oneline faa sequence with index only) itself takes more than 700 cpu hours and does not finish at this moment. In the paper, the mmseqs cluster 1.6 billion sequences with around 10 hours. I am wondering whether it includes the time for createdb and clusthash steps?

Besides, is there any suggestion on how to speed up the createdb module?

milot-mirdita commented 3 years ago

MMseqs2 has a limitation to databases of at most ~4 billion sequences (UINT_MAX). You have to cluster in multiple splits. @martin-steinegger should be able to help with an example.

yonghanyu commented 3 years ago

MMseqs2 has a limitation to databases of at most ~4 billion sequences (UINT_MAX). You have to cluster in multiple splits. @martin-steinegger should be able to help with an example.

Hi, sorry for bothering but any update on this?

For now I am splitting the protein into multiple fasta, each containing at most 2billion sequences. I will then use the clusthash and linclust on each split. Finally, some tools like mergedb will be used and do a further clusterhash/linclust. I am wondering whether this is the correct way to do since I cannot find related information in the documentation.

snayfach commented 1 year ago

Is there any update on this? Suggestions for creating a mmseqs2 database of this size?

milot-mirdita commented 1 year ago

Please create a new issue describing your use case.

If you want to search more than ~4 billion sequences at once, I'd recommend to first cluster (in multiple stages and subsequently merging the clusterings) to dereplicate the database first and then searching against this smaller database.

Alternatively, I'd recommend to create multiple databases and searching each individually. We have had multiple requests to implement a parameter that would set the real DB size for E-value calculation externally. Maybe something would help for your use case?