sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

The relevent source/paper to cite for each method #164

Open Xinxinatg opened 3 years ago

Xinxinatg commented 3 years ago

Hi thanks for this comprehensive work. Just wondering how can I know how to cite each of these individual embedding method. Like the word2vec one in this repo, there are multiple methods using word2vec but got different names.

sacdallago commented 3 years ago

Hi @Xinxinatg , thanks for bringing this up. It's been a long standing issue for me to rewamp the documentation and give proper credit to model authors. For the majority of the models, you can find at least the link to the manuscript here or here.

For Fastext , Glove and Word2Vec, the weights used in SeqVec were used.

Xinxinatg commented 3 years ago

Thanks @sacdallago ! are there any chance that the details of training dataset for Word2Vec model can be provided? Just found so hard nowadays to find some other available source to extract embeddings from word2vec type of models.

sacdallago commented 3 years ago

@mheinzinger do you remember how that was done?

mheinzinger commented 3 years ago

Hey @Xinxinatg , thanks a lot for your interest in our work! :) Before we started to write up word2vec based models, we already started working on SeqVec and dropped word2vec based models instead to focus on the more promising direction of LSTM-based models such as SeqVec (or now attention-based). So we don't have an own publication for the word2vec based models trained in-house. However, most of them were trained similarly to ProtVec (SwissProt as corpus, k-mer length of 3); I also found some old documentation that we used minCount of 5 and a window size of 10. Unfortunately, I can not give you any further information on those word2vec based models. If you look for references: ProtVec was one of the first (if not the first) to establish word2vec based models on amino acids. ProNA2020 (https://pubmed.ncbi.nlm.nih.gov/32142788/) uses our in-house word2vec version (if you need an application). If you used word2vec weights from SeqVec publication, then you used ProtVec (https://github.com/jowoojun/biovec).