ogrisel / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
158 stars 64 forks source link

Resolve the redirects links from DBpedia #3

Closed ogrisel closed 13 years ago

ogrisel commented 13 years ago

Many links are currently dropped because we don't currently follow the redirect map when available: for instance all links to "China" should redirect to "People's Republic of China" hence the generated NER corpus is lacking statistical clues to detect "China" as an entity of type "Place".

To resolve this, one need to introduce a new COGROUP operation with a conditional GENERATE statement (not all links have a redirect).

ogrisel commented 13 years ago

Fixed using LEFT OUTER joins and conditional expression.