jupyter / notebook

Jupyter Interactive Notebook
https://jupyter-notebook.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
11.55k stars 4.84k forks source link

Wrong handling of fallback languages #5851

Open noah1510 opened 3 years ago

noah1510 commented 3 years ago

If you have fallback languages defined they are preferred over the default language. After some experimentation I noticed that for example if Japanese (ja) is somewhere in the list of languages, it is always preferred and the order is completely ignored.

The order of the LANGUAGE locale is important. The first value should be the preferred language and the following languages should be the fallback.

If there is no supported language in the locale then a warning should be displayed telling the user that their language is not supported and the output will be in English. At the moment it simply ignores the value if it is not a supported language like German (de).

At the moment the priority of the language locales is the following LANGUAGE > LC_ALL > LANG. If one is not set, the next priority level will be used to determine the used language. But if it is set to an invalid/not defined value, English is the language that will be used even if the lower levels define Japanese as the language that should be used. So if LANGUAGE is de which is not supported at the moment, English will be used and LC_ALL and LANG will be ignored instead of falling back to the value of them

Example:

The following to commands both output the Japanese version even if the first one should be in English. The third one will be in English even if it should be in Japanese (or German if there would be support for German).

LANGUAGE=en:ja jupyter notebook --help
LANGUAGE=ja:en jupyter notebook --help
LC_ALL=ja_JP.UTF-8 LANGUAGE=de jupyter notebook --help

My system:

kevin-bates commented 3 years ago

@noah1510 - thank you for opening this issue and the great detail.

In stepping through the gettext.translation() method that notebook uses I think the crux of the issue is that notebook doesn't provide a set of English locales in notebook/i18n/en/LC_MESSAGES/. Were that the case, then you'd get the correct behavior relative to the LANGUAGE env.

That is, you'd get English translations using LANGUAGE=en:ja jupyter notebook --help and Japanese translations using LANGUAGE=ja:en jupyter notebook --help.

After some experimentation I noticed that for example if Japanese (ja) is somewhere in the list of languages, it is always preferred and the order is completely ignored.

If you run LANGUAGE=ru:zh:ja jupyter notebook --help you'll find that Russian help strings are produced. Removing ru then yields Chinese strings, etc. Only embedded 'en' entries (prior to a supported locale) are problematic since there are no en files.

With respect to this case:

LC_ALL=ja_JP.UTF-8 LANGUAGE=de jupyter notebook --help

should not yield Japanese when German translations are not available - at least per the code in gettext.find since it looks for the first env containing a value using the order you gave (LANGUAGE > LC_ALL [> LCMESSAGES] > LANG). So because it used the LANGUAGE value and because de has no entries, it winds up returning the NullTranslations class instance, which (I'm assuming) just returns what string is passed as an argument to `()(the returned translation class'sgettext` method) and resulting in English text.

In order to provide a message indicating that a user's language is not available we'd have to know that one of LANGUAGE, LC_ALL, LC_MESSAGES, LANG is set - which, today, is completely hidden from notebook - then determine that the NullTranslations instance is in use in order to log such a message.

I really don't think it's worth our while to maintain a duplicate set of strings for en so I'm not sure how to address multiple LANGUAGE entries that embed 'en'.

Did you want to take a crack at providing a message if no desired translations are found?

noah1510 commented 3 years ago

Did you want to take a crack at providing a message if no desired translations are found?

I would like to but I don't have time for that at the moment to learn more python and make myself familiar with a new codebase.

I really don't think it's worth our while to maintain a duplicate set of strings for en so I'm not sure how to address multiple LANGUAGE entries that embed 'en'.

Would it be possible to simply use the NullTranslation if en is the requested language?

should not yield Japanese when German translations are not available - at least per the code in gettext.find since it looks for the first env containing a value using the order you gave (LANGUAGE > LC_ALL [> LC_MESSAGES] > LANG).

I didn't look at the code at all and was just talking about the behavior I expected.

I noticed that when I look through the code that changing two lines by removing the break and by appending the rest, should result in the expected behavior for that case:

            if val:
                languages = languages + val.split(':')

instead of:

            if val:
                languages = val.split(':')
                break
kevin-bates commented 3 years ago

Would it be possible to simply use the NullTranslation if en is the requested language?

That's what happens today if en was the only language in the list - just as when there are no translations for language 'xx'. The issue appears to be that the gettext implementation is only considering existing files, so when language 'xx' (which includes 'en' in this case) has no translation files (notebook.mo) that language is dropped out of the list. To preserve its presence, fairly significant changes would be required to the find and translation methods. In addition, you'd need to convey that this behavior is desired for 'en' - which would probably be akin to introducing the idea of a "default" language, etc.

I noticed that when I look through the code that changing two lines by removing the break and by appending the rest, should result in the expected behavior for that case:

That's correct, but I think the bigger issues occur when actually processing the hierarchy of found languages and the requirement that each be associated with a .mo file.

Looking at how these .mo files get created, they are produced during the build via pybabel (see CompileBackendTranslation) and I suppose that could be extended to use pybabel extract and pybabel init to produce en_US entries on the fly - although we'd need to understand the impact to the build relative to the likelihood users are adding multiple languages to LANGUAGE where one entry is an embedded en - and prior entries have no translations (which I would guess is rare). That said, it might be worth determining the impact of an extract/init/compile sequence for en_US to the build - since that kind of thing is not entirely time-critical and existence of en_US would likely solve these issues.

We'd still have the message to display, but even that gets much easier since you could then assume that a NullTranslations instance truly means 'no translations were found'.