sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
460 stars 65 forks source link

from where comes the models in "bio_embeddings/utilities /defaults.yml", where is docs, parameters, dataset ? #239

Open laraque opened 11 months ago

laraque commented 11 months ago

Hello Team,

where i can find information about how was trained the models published from the repository linked from the file:

file: bio_embeddings/utilities /defaults.yml model : http://data.bioembeddings.com/public/embeddings/embedding_models/word2vec/word2vec.model

For instance, which where the parameters to train the Word2vec model ? it was used the CBOW or skip-gramm methodology ? Which dataset was used ?

In need to use different vector embedding size, but the word2vec model is fixed to 512 embedding size, even if i change this parameter in the corresponding embedding pipeline to 24 for instace, i got the error.

File "/bio_embeddings/embed/word2vec_embedder.py", line 48, in embed embedding[index, :] = self._get_kmer_representation(k_mer) ValueError: could not broadcast input array from shape (512,) into shape (24,)

Thank for your comments,