sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
457 stars 65 forks source link

Including Sapiens antibody embeddings #171

Open prihoda opened 2 years ago

prihoda commented 2 years ago

Hi @sacdallago and team,

we've been experimenting with training models for human antibody sequence representation, as described in this preprint: https://www.biorxiv.org/content/10.1101/2021.08.08.455394v1

Our Sapiens model and weights are open-sourced. Would you consider merging a PR that integrates the model? What are the prerequisites, and what should be included in the PR?

Thanks! David

sacdallago commented 2 years ago

Hi @prihoda ,

sorry for very late reply -- was away for a longer bit! Sure, I'd be happy to integrate the model, regarding the data: I can host it for you, or you can upload it to zenodo, if it's less than 50GB. I noticed that the uptake of data from this resource is quite significant.

I'd be happy to set up a call to discuss if you are still interested :) I'm slowly maintenance work on bio-embeddings :)

prihoda commented 2 years ago

Hi @sacdallago that's great to hear.

Just recently I created a dedicated repo for Sapiens: https://github.com/Merck/Sapiens. It's not on pypi yet though, turns out the sapiens package name is blocked (https://github.com/pypa/pypi-support/issues/1651)

It depends on fairseq https://github.com/pytorch/fairseq. Is that a problem?

By data, you mean the weights? Those are just a few Mb since I kept the model small (for NLP standards).

I'd also be happy to have a call. I just emailed you about another thing :)