idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

Question about English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram #16

Closed zhq2009 closed 8 years ago

zhq2009 commented 8 years ago

We are trying to use the DBpedia vectors available at https://github.com/idio/wiki2vec#prebuilt-models English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram

Would you mind letting us know whether the vectors include multi-word entities (e.g. Barack_Obama) or are about only "single words" ? Thanks.

dav009 commented 8 years ago

The relevant information offered by this package are the vectors generated from Wikipedia annotations. Wikipedia annotations corresponds to links users add to a wikipedia article referring to another article ( i.e: Linking to Barack Obama in the article about US Politics)

In that sense the file mentioned in the readme includes:

The wikipedia entity vectors for single words differ fro the vectors of wikipedia entites in the sense that an occurrence of DBPEDIA_ID/Barack_Obama takes place every time an annotation to Barack_Obama was found on a wikipedia text regardless of its anchor (i.e: its anchor could have been : B. Obama or Barack O. or Barry Obama, President of the USA).

zhq2009 commented 8 years ago

Thank you for your help.

We are trying to use the DBpedia vectors available at https://github.com/idio/wiki2vec#prebuilt-models English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram

If we want to see vectors of few multi-word entities (or at least the beginning of the vector) ? e.g. Barack_obama; White_house; Artificial_Intelligence; Computer_science; Natural_language_processing and so on ?

We try to open the en.model directly from Ubuntu and get error message of "Unknow file type", If we use "cat en.model" in the terminal we still get some messy code. Is there a way to open en.model and can let us see the DBpedia vectors?

dav009 commented 8 years ago

yeah, those are gensim models. You have to use python and gensim to load them. check this gensim word2vec tutorial