marcusklang / wikiforia

A Utility Library for Wikipedia dumps
GNU General Public License v2.0
33 stars 15 forks source link

Missing words and bad hyphenation in french #4

Open aadant opened 9 years ago

aadant commented 9 years ago

java -jar target/wikiforia-1.2.1.jar --pages ../frwiki-20150602-pages-articles-multistream.xml.bz2 -lang fr -o xml

interrupt after a couple of minutes since the issue is in the first pages

Example : Amsterdam, id = 245

Le est considéré comme l'âge d'or d'Amsterdam car elle devient à cette époque la ville la plus riche du monde.

should be

Le XVIIe siècle est considéré comme l'âge d'or d'Amsterdam car elle devient à cette époque la ville la plus riche du monde

LAndalousie LAndalousie should be L'Andalousie
marcusklang commented 9 years ago

Sorry for the late response. This is a problem with unsupported template expansion.

The raw wikimarkup for the text that is incorrectly translated is:

Le {{s|XVII|e}} est considéré comme l'[[âge d'or]] d'Amsterdam car elle devient à cette époque la ville la plus riche du monde

Which uses a template "s". The French edition uses templates far more frequent for common formatting than that of e.g. English and Swedish.

I have plans on implementing template expansion by using a fast disk-based hashmap, but the performance will depend on how much memory that is available for caching and you will have to do two passes over the data.

I cannot give you a timeline for when this feature will be included other than that it is on the TODO list and is considered highly important.

aadant commented 9 years ago

Thank you for your feedback. It might be a sweble issue. I will raise another issue for the missing hyphen in Andalousie

aadant commented 9 years ago

Hey Marcus, I was looking at this project : https://github.com/attardi/wikiextractor/issues/32#issuecomment-136178794

Looks like you will also need to support Modules (and Lua !). Fortunately there are Java implementations of Lua. So it can still be full java.