sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
460 stars 65 forks source link
bio-embeddings embedders language-model machine-learning pipeline protein-prediction protein-sequences protein-structure sequence-embeddings

Bio Embeddings

Resources to learn about bio_embeddings:

Project aims:

The project includes:

Installation

You can install bio_embeddings via pip or use it via docker. Mind the additional dependencies for align.

Pip

Install the pipeline and all extras like so:

pip install bio-embeddings[all]

To install the unstable version, please install the pipeline like so:

pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"

If you only need to run a specific model (e.g. an ESM or ProtTrans model) you can install bio-embeddings without dependencies and then install the model-specific dependency, e.g.:

pip install bio-embeddings
pip install bio-embeddings[prottrans]

The extras are:

Docker

We provide a docker image at ghcr.io/bioembeddings/bio_embeddings. Simple usage example:

docker run --rm --gpus all \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/bioembeddings/bio_embeddings:v0.1.6 /mnt/config.yml

See the docker example in the examples folder for instructions. You can also use ghcr.io/bioembeddings/bio_embeddings:latest which is built from the latest commit.

Dependencies

To use the mmseqs_search protocol, or the mmsesq2 functions in align, you additionally need to have mmseqs2 in your path.

Installation notes

bio_embeddings was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsistencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.

What model is right for you?

Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.

The models prottrans_t5_xl_u50, esm1b, esm, prottrans_bert_bfd, prottrans_albert_bfd, seqvec and prottrans_xlnet_uniref100 were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_t5_xl_u50, followed by esm1b.

Usage and examples

We highly recommend you to check out the examples folder for pipeline examples, and the notebooks folder for post-processing pipeline runs and general purpose use of the embedders.

After having installed the package, you can:

  1. Use the pipeline like:

    bio_embeddings config.yml

    A blueprint of the configuration file, and an example setup can be found in the examples directory of this repository.

  2. Use the general purpose embedder objects via python, e.g.:

    from bio_embeddings.embed import SeqVecEmbedder
    
    embedder = SeqVecEmbedder()
    
    embedding = embedder.embed("SEQVENCE")

    More examples can be found in the notebooks folder of this repository.

Cite

If you use bio_embeddings for your research, we would appreciate it if you could cite the following paper:

Dallago, C., Schütze, K., Heinzinger, M., Olenyi, T., Littmann, M., Lu, A. X., Yang, K. K., Min, S., Yoon, S., Morton, J. T., & Rost, B. (2021). Learned embeddings from deep learning to visualize and predict protein sets. Current Protocols, 1, e113. doi: 10.1002/cpz1.113

The corresponding bibtex:

@article{https://doi.org/10.1002/cpz1.113,
author = {Dallago, Christian and Schütze, Konstantin and Heinzinger, Michael and Olenyi, Tobias and Littmann, Maria and Lu, Amy X. and Yang, Kevin K. and Min, Seonwoo and Yoon, Sungroh and Morton, James T. and Rost, Burkhard},
title = {Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets},
journal = {Current Protocols},
volume = {1},
number = {5},
pages = {e113},
keywords = {deep learning embeddings, machine learning, protein annotation pipeline, protein representations, protein visualization},
doi = {https://doi.org/10.1002/cpz1.113},
url = {https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpz1.113},
eprint = {https://currentprotocols.onlinelibrary.wiley.com/doi/pdf/10.1002/cpz1.113},
year = {2021}
}

Additionally, we invite you to cite the work from others that was collected in `bio_embeddings` (see section _"Tools by category"_ below). We are working on an enhanced user guide which will include proper references to all citable work collected in `bio_embeddings`.

Contributors

Want to add your own model? See contributing for instructions.

Non-exhaustive list of tools available (see following section for more details):

Datasets


Tools by category

Pipeline
- align: - DeepBlast (https://www.biorxiv.org/content/10.1101/2020.11.03.365932v1) - embed: - ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554) - SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8) - ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554) - ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554) - ProtTrans T5 trained on BFD (https://doi.org/10.1101/2020.07.12.199554) - ProtTrans T5 trained on BFD and fine-tuned on UniRef50 (in-house) - UniRep (https://www.nature.com/articles/s41592-019-0598-1) - ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3) - PLUS (https://github.com/mswzeus/PLUS/) - CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf) - project: - t-SNE - UMAP - PB-Tucker (https://www.biorxiv.org/content/10.1101/2021.01.21.427551v1) - visualize: - 2D/3D sequence embedding space - extract: - supervised: - SeqVec: DSSP3, DSSP8, disorder, subcellular location and membrane boundness as in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8 - ProtBertSec and ProtBertLoc as reported in https://doi.org/10.1101/2020.07.12.199554 - unsupervised: - via sequence-level (reduced_embeddings), pairwise distance (euclidean like [goPredSim](https://github.com/Rostlab/goPredSim), more options available, e.g. cosine)
General purpose embedders
- ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554) - SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8) - ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554) - ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554) - ProtTrans T5 trained on BFD (https://doi.org/10.1101/2020.07.12.199554) - ProtTrans T5 trained on BFD + fine-tuned on UniRef50 (https://doi.org/10.1101/2020.07.12.199554) - Fastext - Glove - Word2Vec - UniRep (https://www.nature.com/articles/s41592-019-0598-1) - ESM/ESM1b (https://www.biorxiv.org/content/10.1101/622803v3) - PLUS (https://github.com/mswzeus/PLUS/) - CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)