The locale now has a list of NE which are language dependent, this list of NE are filtered when processing a certain language. This NE corresponds to spaces like User( Benutzer in german, Usuario in spanish and so on). Those NE comes from the Wikipedia API
There is a way to check whether a page is disambiguation using a "Wikipedia Directive". So every article being a disambiguation has something like {{Disambiguation}} somewhere in its content(changes across languages). This directive is not included in the XML dump. And therefore we are a bit blind at the moment.
For the given languages :
Checked Redirects
Checked NEs
Checked images
Checked Categories
Checked Basic Disambiguation
Checked for links in other languages
Checked for potential empty links
Looked for utf-8 weird problems (characters being treated as ??)
Now the script uses the locale in the dict generated by @keynmol from the java file. The disambiguation keywords for some languages i.e: FR look strange. I will leave this disambiguation keywords issue there and probably address this in a more sensible way in: https://github.com/idio/json-wikipedia/issues/31
Following @Lugrin suggestion, I changed the way in which the namespace header is looked up
Connects to idio/json-wikipedia#28 Connects to idio/content-services#629
In general:
Added a script which generates the locales for potentially any language.
python localegen.py --lang de --o german.locale
Please be aware that such locale will not contain the following fields:
list
anddisambiguation
.disambiguation
keywords collected herelist
but there is a directive for them (same as for disambiguation).list
ordisambiguation
is very weak. I added : https://github.com/idio/json-wikipedia/issues/31.In detail
User
(Benutzer
in german,Usuario
in spanish and so on). Those NE comes from the Wikipedia API{{Disambiguation}}
somewhere in its content(changes across languages). This directive is not included in the XML dump. And therefore we are a bit blind at the moment.For the given languages :