Doc2Vec / Word2Vec example on biological sequences

piskvorky / gensim

Topic Modelling for Humans

https://radimrehurek.com/gensim

GNU Lesser General Public License v2.1

15.71k stars 4.38k forks source link

Doc2Vec / Word2Vec example on biological sequences #645

Closed ziky90 closed 7 years ago

ziky90 commented 8 years ago

I have moved discussion about the doc2vec / word2vec ipython example from https://github.com/piskvorky/gensim/issues/629 as it was suggested by @Piezoid.

ideas:

Doc2Vec on biological sequences, particularly protein primary structures. It would be more or less based on this paper: http://arxiv.org/abs/1503.05140 (I am working on this, though I'm realising, that more installed external libraries might be needed in the iPython notebook)

ddofer commented 8 years ago

I would love to see this. (I did it myself with another library as part of my research. It gave very poor results unfortunately).

ddofer commented 8 years ago

I wanted to do this myself but ran into trouble parsing the protein sequences into characters in a way amenable to Gensim. I'd love to help or learn how it's done, it would be very useful for a paper I had in mind. (Full credit ofc).

piskvorky commented 8 years ago

Sounds good :) What specific issues did you run into?

ddofer commented 8 years ago

Data munging, efficiently extracting from uniprot's file formats (fasta or xml), Using pure existing vectors , Ngrams vs unigrams x And understanding general usage of gensim, e.g. "sentence labels".

(I haven't used the package before, only read your blog posts prior to that , and this was a while ago) On May 2, 2016 2:22 PM, "Radim Řehůřek" notifications@github.com wrote:

Sounds good :) What specific issues did you run into?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/piskvorky/gensim/issues/645#issuecomment-216208683

menshikh-iv commented 7 years ago

I closed this because original PR was abandoned.