Closed raymanrt closed 13 years ago
The pagelinks.sql.gz file does not contain the surrounding text of the links which is the statistical meat that the OpenNLP trainer will look for.
I haven't worked with Wikinews so far, it should be quite easy to adapt the existing script to handle inter-wiki links.
Tell me if you can get it to train Italian models.
Ok, I thanks. I'll keep you updated on my experiments.
Did you considered to use the wiki pagelinks table ( http://download.wikimedia.org/frwiki/20110716/frwiki-20110716-pagelinks.sql.gz ) instead of parsing the entire xml?
And what do you think about other wiki sources such as WikiNews?
I'm really interested in using your approach to build a NER system for italian language (probably) based on OpenNLP.