Closed fako closed 5 years ago
The test category is: https://nl.wikipedia.org/wiki/Categorie:Aardwetenschappen Currently it is not querying subcategory pages, although it would make sense to include those. Wikipedia redirects are also not handled at all. Next up is to create a vocabulary from the input.
When comparing the Wikipedia vocabulary with the manual vocabulary then only 2 words out of the 29 match. This means we're still missing a lot of words. There are a number of reasons for this:
The amount of "freak words" is surprisingly high in the manual vocabulary that we constructed. As people we know what they mean, but it is going to be hard to find a good data source for them. Examples: depressiesyndroom, moleculairs, breinactiviteit, paranoidestoornis, beantwoordbare, ziekte-ervaring, hallucinatiestemmen, opvlambare, psychosesyndroom
The first approach is going to consist of getting all pages that are under a category and treat these as a corpus of articles. These articles can then create a vocabulary.