[ch68108]
currently we are passing wikipedia modules as articles. Wikipedia modules, contain source code used for wikipedia internals. Those pages can be quite long and we can't extract anything from them.
module pages take so long to run through snowballtokenizer, and spotter
In this PR:
Bliki already has a method for checking if a page is a wikipedia module : isModule
This PR tag those pages as module so we can skip them later down the pipeline
[ch68108] currently we are passing wikipedia modules as articles. Wikipedia modules, contain source code used for wikipedia internals. Those pages can be quite long and we can't extract anything from them.
module pages take so long to run through snowballtokenizer, and spotter
In this PR:
isModule
module
so we can skip them later down the pipelineexample module page: https://simple.wikipedia.org/wiki/Module:ISO_639/data