sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
460 stars 65 forks source link

Custom embeddings #225

Open rhysnewell opened 1 year ago

rhysnewell commented 1 year ago

Hi Devs,

Thanks for this package, really cool work and it seems very well put together. I just had a question regarding the creation of custom embedding sets. In this example (https://github.com/sacdallago/bio_embeddings/blob/develop/notebooks/goPredSim.ipynb) you use the ProtBertBFDEmbedder to generate embeddings for a novel peptide and compare it against a set of reference embeddings (https://github.com/sacdallago/bio_embeddings/blob/develop/notebooks/goPredSim.ipynb). You use k-nn to determine which UniProt entry best matched the novel peptide and return the accession.

I was wondering, is it possible to create a completely custom reference embedding h5 file from a database other than UniProt (Like a virulence factor database) and then compare novel peptide embeddings to that reference embedding set? Or is that outside the scope of these models?

Just want to make sure that that use case is valid before I pursue this.

Cheers, Rhys