Closed DaRinker closed 1 year ago
The checkpointing was added, I am still working on parallelization feature
Both features are now live and tested.
python embeddings.py start infile.fasta output --asdir -nproc X -bs 0
will spawn X
independent processes and distribute input sequences over them.
typing
python embeddings.py resume output
will resume broken or interrupted calculations For more details in Readme file.
Looking forwards for your feedback
This is fantastic! Trying now
Use case example: I need to built a database from 100s of species' proteomes. The current (GPU based) approach to building this database in plm-blast would take over one month of uninterrupted GPU time. This is unrealistic for me (and I'm guessing for most users) because to have unfettered access to a GPU with sufficient memory for this long is not common. Moreover, any technical issues arising during the build process (power loss, memory overflow, etc) would result in my having to begin the process all over again.
Therefore, I think many users would benefit if the database construction step could be enhanced to include:
And thanks for the great software. Our pilot testing has been promising, and we're very anxious to try it out on our larger datasets