idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
601 stars 137 forks source link

null as title of the article #12

Open nick-magnini opened 8 years ago

nick-magnini commented 8 years ago

When the wikipedia is processed for word2vec corpus, the titles of the pages (the first word of each line) is null. So basically all pages start with "null..". Which part of the code takes care of that and how can we change it so instead of that we can present it with the page title?

nick-magnini commented 8 years ago

I still get null as the first token of each line ....

dav009 commented 8 years ago

Probably best way to address this problem is to use : https://github.com/idio/json-wikipedia

for extracting text out of the dumps. I will work on a refactor for it