ogrisel / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
158 stars 64 forks source link

NER corpus: add special treatment for first sentence of article #4

Open ogrisel opened 13 years ago

ogrisel commented 13 years ago

In the vast majority of Wikipedia article, the noun phrase at the beginning of the article is the name of the entity described by the article it-self, even though there is no self pointing link to referencing it-self. This make the NER corpus scripts miss many potentially informative links that might hurt the performance of the trained OpenNLP models.