idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

Write namespaces tag in parallel mode #26

Closed keynmol closed 9 years ago

keynmol commented 9 years ago

Closes #25.

Just write the list of namespaces to every split of xml dump, otherwise bliki doesn't pick up namespace (why it wouldn't look at the tag I have no idea)

Why is it important? without properly recognising wikipedia namespace, pages like this are recognised as articles and it later on skews context vectors for most of the dbpedia ids.

dav009 commented 9 years ago

looks good. Tho in the ideal case we would parse the first lines of the wiki and extract this automatically I guess

dav009 commented 9 years ago

go for it speedracer ;)