labstructbioinf / pLM-BLAST

Detection of remote homology by comparison of protein language model representations
https://toolkit.tuebingen.mpg.de/tools/plmblast
MIT License
45 stars 5 forks source link

FEATURE REQUEST: faster and more robust database construction #13

Closed DaRinker closed 1 year ago

DaRinker commented 1 year ago

Use case example: I need to built a database from 100s of species' proteomes. The current (GPU based) approach to building this database in plm-blast would take over one month of uninterrupted GPU time. This is unrealistic for me (and I'm guessing for most users) because to have unfettered access to a GPU with sufficient memory for this long is not common. Moreover, any technical issues arising during the build process (power loss, memory overflow, etc) would result in my having to begin the process all over again.

Therefore, I think many users would benefit if the database construction step could be enhanced to include:

  1. parallelization options
  2. a checkpointing feature

And thanks for the great software. Our pilot testing has been promising, and we're very anxious to try it out on our larger datasets

Argusmocny commented 1 year ago

The checkpointing was added, I am still working on parallelization feature

Argusmocny commented 1 year ago

Both features are now live and tested.

python embeddings.py start infile.fasta output --asdir -nproc X -bs 0

will spawn X independent processes and distribute input sequences over them.

typing

python embeddings.py resume output

will resume broken or interrupted calculations For more details in Readme file.

Looking forwards for your feedback

DaRinker commented 1 year ago

This is fantastic! Trying now