surfedushare / pol-harvester

A repository that harvests different sources for content
2 stars 0 forks source link

Given a domain, try to construct a vocabulary, using dbpedia, wikipedia, wikidata and/or similar sites. #79

Closed fako closed 5 years ago

fako commented 5 years ago

The first approach is going to consist of getting all pages that are under a category and treat these as a corpus of articles. These articles can then create a vocabulary.

fako commented 5 years ago

The test category is: https://nl.wikipedia.org/wiki/Categorie:Aardwetenschappen Currently it is not querying subcategory pages, although it would make sense to include those. Wikipedia redirects are also not handled at all. Next up is to create a vocabulary from the input.

fako commented 5 years ago

Find newly build vocabularies at:

fako commented 5 years ago

When comparing the Wikipedia vocabulary with the manual vocabulary then only 2 words out of the 29 match. This means we're still missing a lot of words. There are a number of reasons for this:

The amount of "freak words" is surprisingly high in the manual vocabulary that we constructed. As people we know what they mean, but it is going to be hard to find a good data source for them. Examples: depressiesyndroom, moleculairs, breinactiviteit, paranoidestoornis, beantwoordbare, ziekte-ervaring, hallucinatiestemmen, opvlambare, psychosesyndroom