ogrisel / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
158 stars 64 forks source link

Wikilinks #5

Closed raymanrt closed 13 years ago

raymanrt commented 13 years ago

Did you considered to use the wiki pagelinks table ( http://download.wikimedia.org/frwiki/20110716/frwiki-20110716-pagelinks.sql.gz ) instead of parsing the entire xml?

And what do you think about other wiki sources such as WikiNews?

I'm really interested in using your approach to build a NER system for italian language (probably) based on OpenNLP.

ogrisel commented 13 years ago

The pagelinks.sql.gz file does not contain the surrounding text of the links which is the statistical meat that the OpenNLP trainer will look for.

I haven't worked with Wikinews so far, it should be quite easy to adapt the existing script to handle inter-wiki links.

Tell me if you can get it to train Italian models.

raymanrt commented 13 years ago

Ok, I thanks. I'll keep you updated on my experiments.