Given a domain, try to construct a vocabulary, using dbpedia, wikipedia, wikidata and/or similar sites.

fako commented 5 years ago

The first approach is going to consist of getting all pages that are under a category and treat these as a corpus of articles. These articles can then create a vocabulary.

fako commented 5 years ago

The test category is: https://nl.wikipedia.org/wiki/Categorie:Aardwetenschappen Currently it is not querying subcategory pages, although it would make sense to include those. Wikipedia redirects are also not handled at all. Next up is to create a vocabulary from the input.

fako commented 5 years ago

Find newly build vocabularies at:

fako commented 5 years ago

When comparing the Wikipedia vocabulary with the manual vocabulary then only 2 words out of the 29 match. This means we're still missing a lot of words. There are a number of reasons for this:

Some words occur on pages that are deeper in the category hierarchy. If we would traverse the hierarchy these words would show up
Some words don't occur on Wikipedia in the current variation. For instance "breinactiviteit" doesn't occur, but "hersenactiviteit" does
Some words are more education specific like "nakijkblad". We can't expect to find these on domain specific content pages
Some practical words are also not on Wikipedia like "sondevoorschrift", but "sonde" is present
Some words like "angstsyndroom" exist on Wiktionary, but not Wikipedia
Gordon Guyatt is a Canadian doctor who is known on English Wikipedia, but not the Dutch and we were looking for Dutch words

The amount of "freak words" is surprisingly high in the manual vocabulary that we constructed. As people we know what they mean, but it is going to be hard to find a good data source for them. Examples: depressiesyndroom, moleculairs, breinactiviteit, paranoidestoornis, beantwoordbare, ziekte-ervaring, hallucinatiestemmen, opvlambare, psychosesyndroom

surfedushare / pol-harvester

Given a domain, try to construct a vocabulary, using dbpedia, wikipedia, wikidata and/or similar sites. #79