sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

Using cached weights with bio-emebdddings #114

Closed deniseduma closed 3 years ago

deniseduma commented 3 years ago

Hi,

I have an issue with using bio-emebddings on a Slurm cluster as the Internet is somehow not accessible so I would need the pre-trained model weights downloaded and cached locally as you mentioned here:

"Same as before, but using cached weights, which is faster: use_case_two Use case: you have a set of proteins (in FASTA format) and want to create amino acid-level embeddings, as well as protein-level embeddings. Additionally, you have an annotation file with some property for a subset of the proteins in your dataset. For these, you want to produce a visualization of the sequences and how they separate in space. This time around: you downloaded the models locally (faster execution) and want to provide the path to the model's weights and options."

Can you tell me please point me to how to do it, that is what to download, from where, and how to adjust the paths? I would like to use the BFDBert and SeqVec embedders.

Thank you, Denise

konstin commented 3 years ago

Hi Denise,

when using bio_embeddings installed with pip, it already downloads the weights only once, storing them in the user cache dir (usually ~/.cache/bio_embeddings). This however may not work well with a cluster if the user directory isn't shared.

For docker, you could use a volume shared between multiple runs:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weight_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

I've updated the readme accordingly because that's actually a good thing to have as default.

To use a custom weight location, go to https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/utilities/defaults.yml and download the files for the method you chose. If it's a zip file, you need to unpack it. For SeqVec and Bert, this would be

mkdir my_model_directory
cd my_model_directory
wget http://data.bioembeddings.com/public/embeddings/embedding_models/seqvec/weights.hdf5
wget http://data.bioembeddings.com/public/embeddings/embedding_models/seqvec/options.json
wget http://data.bioembeddings.com/public/embeddings/embedding_models/bert/prottrans_bert_bfd.zip
mkdir prottrans_bert_bfd
cd prottrans_bert_bfd
unzip ../prottrans_bert_bfd.zip
cd ..
rm prottrans_bert_bfd.zip
cd ..

In the pipeline definition you can then use them like this:

global:
  sequences_file: fasta.fa
  prefix: my_prefix
seqvec_embeddings:
  type: embed
  protocol: seqvec
  weights_file: /path/to/my_model_directory/weights.hdf5
  options_file: /path/to/my_model_directory/options.json
bert_embeddings:
  type: embed
  protocol: prottrans_bert_bfd
  model_directory: /path/to/my_model_directory/prottrans_bert_bfd

I've never worked with slurm, so unfortunately I can't help with how to provision the files on the cluster, but I hope that the above examples help to find a good solution.

deniseduma commented 3 years ago

Hi Konstin,

Thank you very much for getting back to me, really appreciated it!

I've eventually figured it out, and did pretty much what you recommend!

Basically, I've run bio_embeddings on the login node of the Slurm cluster which had no issue connecting to the Internet, and noticed that the weights got downloaded! Then, as my home folder is shared across all nodes in the cluster, I was able to run bio_embeddings on one of the GPU nodes which used the cached weights in my home folder at .cache/bio_embeddings.

It took me a while to figure this out, so I guess I got pretty frustrated and started bombarding you guys with messages for help! ;) Sorry, about that!

I also wasn't able to install bio_embeddings on Google Colab anymore, but now that I have it up and running on the Cluster I'm ok.

Thanks a bunch for getting back to me and for your help!

Denise