idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

annotating articles with Module type #63

Closed dav009 closed 5 years ago

dav009 commented 5 years ago

[ch68108] currently we are passing wikipedia modules as articles. Wikipedia modules, contain source code used for wikipedia internals. Those pages can be quite long and we can't extract anything from them.

module pages take so long to run through snowballtokenizer, and spotter

In this PR:

example module page: https://simple.wikipedia.org/wiki/Module:ISO_639/data