idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

Tagging module pages #64

Closed dav009 closed 5 years ago

dav009 commented 5 years ago

[ch68108] currently we are passing wikipedia modules as articles. Wikipedia modules, contain source code used for wikipedia internals. Those pages can be quite long and we can't extract anything from them.

module pages take so long to run through snowballtokenizer, and spotter

In this PR:

example module page: https://simple.wikipedia.org/wiki/Module:ISO_639/data

tgalery commented 5 years ago

Just out of curiosity is there any other category of Bliki that we might not be using ?

dav009 commented 5 years ago

MediaWiki seems like another namespace that we could try to sack off. for example: MediaWiki:Gadget-morebits.js