Open nick-magnini opened 8 years ago
I still get null as the first token of each line ....
Probably best way to address this problem is to use : https://github.com/idio/json-wikipedia
for extracting text out of the dumps. I will work on a refactor for it
When the wikipedia is processed for word2vec corpus, the titles of the pages (the first word of each line) is null. So basically all pages start with "null..". Which part of the code takes care of that and how can we change it so instead of that we can present it with the page title?