idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

Improve detection of disambiguation pages #31

Closed dav009 closed 8 years ago

dav009 commented 8 years ago

Currently an article is typed as disambiguation if any of the keywords for disambiguation specified in the locale file are part of the article's title.

For example the german locale is expected to contain the disambiguation keyword: Begriffsklärung, so that the article Löwe_(Begriffsklärung) is marked as a disambiguation. However this do not apply to all disambiguation pages. i.e: https://de.wikipedia.org/wiki/Loewe

-There is a wikipedia directive which is language dependent. In the german case is {{Begriffsklärung}}, this is consistent accross all disambiguation pages. However this directive is not included in the wikipedia xml dump

dav009 commented 8 years ago

worth checking this for proper disambiguation detection:

http://stackoverflow.com/questions/34316979/wikipedia-xml-dump-where-to-get-translations-for-disambiguation-directives

dav009 commented 8 years ago

xml dumps include the template not the magic word:

i.e: eswiki xml includes {{desambiguación}} consistantly