Utilities for creating Word2Vec vectors for Dbpedia Entities via a Wikipedia Dump.
Within the release of Word2Vec the Google team released vectors for freebase entities trained on the Wikipedia. These vectors are useful for a variety of tasks.
This Tool will allow you to generate those vectors. Instead of mids
entities will be addressed via DbpediaIds
which correspond to wikipedia article's titles.
Vectors are generated for (i) words appearing inside wikipedia (ii) vectors for topics i.e: dbpedia/Barack_Obama
.
You can download via torrent one of the prebuilt word2vec models:
English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram
German Wikipedia (Feb 2015) 300 dimension - No stemming - 10cbow
pip install gensim
tar -xvf model.tar.gz
from gensim.models import Word2Vec
model = Word2Vec.load("path/to/word2vec/en.model")
model.similarity('woman', 'man')
The automated Script set up and runs everything on Ubuntu 14.04. For other Platforms check Going the long way
Run sudo sh prepare.sh <Locale> PathToOutputFolder
. i.e:
sudo sh prepare.sh es_ES /mnt/data/
will work on the spanish wikipediasudo sh prepare.sh en_US /mnt/data/
will work on the english wikipediasudo sh prepare.sh da_DA /mnt/data/
will work on the danish wikipediaRunning prepare
will:
language.corpus
file in outputFolder
, this corpus can be fed to any word2vec tool to generate vectors.Once you get language.corpus
go to resources/gensim
and do:
wiki2vec.sh pathToCorpus pathToOutputFile <MIN_WORD_COUNT> <VECTOR_SIZE> <WINDOW_SIZE>
this will install all requiered dependencies for Gensim and build word2vec vectors.
i.e:
wiki2vec.sh corpus output/model.w2c 50 500 10
prepare.sh
script installs:
wiki2vec.sh
script installs:
JAVA_HOME
is pointing to Java 7 sbt assembly
Wikipedia dumps are stored in xml format. This is a difficult format to process in parallel because the xml file has to be streamed getting the articles on the go. A Readable wikipedia Dump is a transformation of the dump such that it is easy to pipeline into tools such as Spark or Hadoop.
Every line in a readable wikipedia dump follows the format:
Dbpedia Title
<tab>
Article's Text
The class org.idio.wikipedia.dumps.ReadableWiki
gets a multistreaming-xml.bz2
wikipedia dump and outputs a readable wikipedia.
params:
java -Xmx10G -Xms10G -cp org.idio.wikipedia.dumps.ReadableWiki wiki2vec-assembly-1.0.jar path-to-wiki-dump/eswiki-20150105-pages-articles-multistream.xml.bz2 pathTo/output/ReadableWikipedia
Creates a Tokenized corpus which can be fed into tools such as Gensim to create Word2Vec vectors for Dbpedia entities.
DbpediaId/DbpediaIDToLink
. i.e: if an article's text contains:
[[ Barack Obama | B.O ]] is the president of [[USA]]
is transformed into:
DbpediaID/Barack_Obama B.O is the president of DbpediaID/USA
Readable Wikipedia
bin/spark-submit --master local[*] --executor-memory 1g --class "org.idio.wikipedia.word2vec.Word2VecCorpus" target/scala-2.10/wiki2vec-assembly-1.0.jar /PathToYourReadableWiki/readableWiki.lines /Path/To/RedirectsFile /PathToOut/Word2vecReadyWikipediaCorpus
By default the word2vec corpus is always stemmed. If you don't want that to happen:
pass None as an extra argument
sudo sh prepare.sh es_ES /mnt/data/ None
will work on the spanish wikipedia and won't stem words
Pass None as an extra argument when calling spark
bin/spark-submit --class "org.idio.wikipedia.word2vec.Word2VecCorpus" target/scala-2.10/wiki2vec-assembly-1.0.jar /PathToYourReadableWiki/readableWiki.lines /Path/To/RedirectsFile /PathToOut/Word2vecReadyWikipediaCorpus None
number of dimensions
* vocabulary size
has to be less than a certain value otherwise an exception is thrown. issue