spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
772 stars 129 forks source link

how to best handle cross wikipedia multi-lingual string translations #306

Closed waldenn closed 4 years ago

waldenn commented 5 years ago

I am personally using a large table to do various multi-lingual lookups for things like:

The table currently looks like this (some parts like the Category and Portal translations are not yet finished):

   wp_languages : [
    // title, latin-name, script, lang2, 10^users, lang3-eng, lang3-localized, category-localized, portal-localized, voice
    ["English Wikipedia","English","Latn","en",5,"eng","eng","Category","Portal","en-GB"],
    ["Deutschsprachige Wikipedia","German","Latn","de",4,"ger","deu","Kategorie","Portal","de-DE"],
    ["Wikipédia en français","French","Latn","fr",4,"fre","fra","Catégorie","Portail","fr-FR"],
    ["Wikipedia en español","Spanish","Latn","es",4,"spa","","Categoría","Portal","es-ES"],
    ["Русская Википедия","Russian","Cyrl","ru",4,"rus","","категория","Portal","ru-RU"],
    ["ウィキペディア日本語版","Japanese","Jpan","ja",4,"jpn","","Category","Portal","ja-JP"],
    ["Nederlandstalige Wikipedia","Dutch","Latn","nl",3,"dut","nld","Categorie","Portaal","nl-NL"]
    ...etc.

I would also like to support cross-wiki template names, eg. "see also" is "zie ook" in Dutch. So I may need to add a column for these type of templates. Knowing the name of the category namespace is useful eg. when you want to strip that part from the title. I have not yet researched where this info is stored in the MediaWiki codebase.

Would something like this help this project? Currently the "zie ook" template is not supported in wtf_wikipedia, eventhough "see also" already implements the whole parsing. So with a language lookup these templates would be supported. Maybe ideally the template would retain its english name ("see more") in the wtf_wikipedia API, but use the native string (after a lookup in the table) for the actual template matching.

The actually design of this info can be changed. Eg. I think having a ISO-2 code key-field for lookup would probably be better.

spencermountain commented 5 years ago

yes it would!

i really don't have a smart way for doing this right now either. I'm sure medium-term versions of this library will be split into wtf_ru_wiktionary, etc. But for now it's all kind of a mush.

The i18n data we do have is mostly here. We store some other (pretty redundant!) wiki language data in that folder too.

You're right that there's no magic json file somewhere, either. It all tends to be buried in half-complete WP:Help pages.

Ya, so .categories() works with i18n formats, but we don't have a dedicated 'see also' method (maybe we should!). How are you using Portal data? I don't think we support that right now.

TLDR: please help. ;)

spencermountain commented 5 years ago

Another often-requested thing is fetching the actual language of the page. When we fetch it from the api, we know the language, but not from some wtf('randome stringz') input. So because of this, it's hard to do 'structural' i18n parsing, in a clearer way, like you're trying to do. This is part of the reason we're just attempting all-languages-all-the-time.

waldenn commented 5 years ago

Here is a first draft of the "wp_languages" table (indexed by language code): languages.js

At the moment I don't have the time to integrate it for cross-wiki template usage, but I hope to help with this integration later. Should not be too difficult I think.

Let me know if I need to fixup the fields in some way, as I have a script to output this table from my own code.

More table info needs to be added to make it complete, but at least its a start.

waldenn commented 5 years ago

Another often-requested thing is fetching the actual language of the page. When we fetch it from the api, we know the language, but not from some wtf('randome stringz') input. So because of this, it's hard to do 'structural' i18n parsing, in a clearer way, like you're trying to do. This is part of the reason we're just attempting all-languages-all-the-time.

For very short strings this is almost impossible to do correctly. I tried using the franc library for this once, and it was not useful at all for short strings. There is a minimum string length needed for this to be able to work accurately.

By default I think the language should be 'en' and let the dev/user override it when needed.

spencermountain commented 5 years ago

ah! I love that and I think it's a great solution. Yeah, it would need to have all of these properties - redirects, infobox, file, etc. I can help glue these together, if you wanted.

So, just thinking this through, the experience of a french user would be 'hey it's stopped finding images!' - then we say, oh, you need to add wtf(text, 'fr'), then it works.

Right now we do an unholy thing by concatenating all language terms into a big regex - (file|ficher|....) and it would be nice to have a cleaner set of things to look up. But it would, by default, fail on other languages.

My understanding is that everyone seems to copy, or lean-heavily on the english names for templates - arabic wiktionary still uses {{date}} and things. Maybe other languages use that lang, AND english - i'm not sure.

urghhh

spencermountain commented 5 years ago

anyways, i think this json file would be useful to have anyway - even if the default behaviour, and the gross-regex combinations of this library don't change.

That's a clean way to model this data, for any future work going in whatever direction.

waldenn commented 5 years ago

Feel free to fix the table up as you want and insert it into git, so we can then share the work on updating the missing data in it. Thanks.

spencermountain commented 4 years ago

hey @waldenn i've added a huge number of i18n terms for redirects, categories, images, infoboxes, and disambiguations i basically just poked around in the browser for them, or copied+pasted from wikidata 😕

that may help. I'd like to look at template-coverage in the future, and could use some help.

Template:Main for example, has a bunch of interlanguage links here - we could just add them as aliases. Let me know what you think. cheers

waldenn commented 4 years ago

Hey Spencer, I see you've been busy. Great work!

For my own project however I think I'll prefer to keep all these various strings in a more compact structure, such as the array of objects I mentioned earlier (and which I am using now). My own project requires easy switching between various languages and I feel that data structure makes that easier. Maintenance is also a bit easier I feel when its just one file with an expandable data structure. I still need to setup some better tooling for merging new data to my languages file though. Anyway, feel free to follow your own approach.

Regarding templates: handling these "common templates" like "main" and "see also" would be really great. Some research on what other (important / frequently used) common templates are out there would also be very useful. Perhaps some template mapping data structure is needed here.

I also see a data issue with various standard infobox fields (like eg. the image field) that I want to detect and present better (eg. expand to the full width of the table, or remove completely). Don't know how to best deal with that yet. Currently I just have array's of fields that need to be transformed in a way (expand, remove, etc.).

spencermountain commented 4 years ago

oh! ya, I've run into that too. This image-in-infobox thing is really sneaky. Please share anything you find. Yeah I have a new-years-resultion to run-through a dozen wikipedia dumps and figure-out what the template situation is, in broad-terms. Will let you know if anything useful comes out of it. I'm sure there's a lot to learn. thanks

waldenn commented 4 years ago

Yeah I have a new-years-resultion to run-through a dozen wikipedia dumps and figure-out what the template situation is, in broad-terms.

Would be great to see the results of that! Great resolution, you deserve a :1st_place_medal:

If I create a better setup for the template-fields I will let you know.