Pseudolocalization - Githubissues

MartinCerny-awin commented 6 years ago

Feature Request I would like to have pseudolocalization automatically done when extracting messages. Pseudolocalization simplifies testing of the localization. When I am doing localization, I might not speak languages I have to localize our app into. The translation is done by professional translator. At the same time I need to verify that all strings are extracted. In pseudolocaliztion strings are replaced by special characters, which are still readable and easily testable.

Example: Account Settings: [!!! Àççôûñţ Šéţţîñĝš !!!]

Describe the solution you'd like CLI with appropriate parameter would create a pseudolocalized translations from default messages. These would be stored under special locale.

Describe alternatives you've considered Pseudolocalization can be run after the extraction.

Additional context https://en.wikipedia.org/wiki/Pseudolocalization

I might be able to implement this functionality, if there is any probability of merging it.

tricoder42 commented 6 years ago

Hey @MartinCerny-awin, sounds reasonable.

What are you actually trying to achieve? Test that you app can handle special characters? Test UI with messages that might be longer in some languages? I just checked briefly wiki...

I'm happy to merge/review the PR! I remember I saw this feature in Transifex but never understood the usecase.

MartinCerny-awin commented 6 years ago

Mostly, I want to verify that all the strings in my app are localizable and extracted. It also helps with testing special characters and longer text, but my app is being translated into Chinese and it is actually shorter.

HamidAghdaee commented 6 years ago

I would very much like this feature as well! Good one.

MartinCerny-awin commented 6 years ago

I have started thinking about pseudolocalization (PL) and I would like to discuss with u the options how to implement it.

1) Locale - PL locale would have to pass test for valid locale. https://github.com/lingui/js-lingui/blob/master/packages/cli/src/api/locales.js#L15-L22 This would allow to add PL locale and extract the messages for them. There are two options: a) Hardcoded locale - we can hardcode locale for PL in the code and directly check agains this hardcoded locale. I would suggest using pseudo-locale as a code. b) Configuration file - add PL into the configuration file. We would check if the locale for PL is set and validate when newly added locale or folder with such name equals to the PL option

2) Pseudocalization - the actual transformation could happen during two CLI commands: a) Compilation - This would give us option to change text that would be getting transformed in language we understand. The text could be changed by adding a translation in messages.json. If translation is not added, the default text would be transformed. The same is currently done for files with missing translations. b) Extraction - This would transform text immediately when doing extraction. The translation in messages.json would be filled with PL text. This would give us option to change translation in PL.

tricoder42 commented 6 years ago

a) Hardcoded locale - we can hardcode locale for PL in the code and directly check agains this hardcoded locale. I would suggest using pseudo-locale as a code.

Locale must be valid BCP-47 and there must be plurals defined for it. We could use en-pseudo or even *-pseudo (for any language).

On the other hand, Android uses xa for PL based on English and xb for right-to-left PL (it basically reverses messages). What do you think about xa and xb? I'm afraid that we need to make an exception for plurals, which won't be defined for these locales, but we could simply load plural rules for the base language (English in default case).

b) Configuration file - add PL into the configuration file. We would check if the locale for PL is set and validate when newly added locale or folder with such name equals to the PL option

This would be great. Something like:

{
  "lingui": {
    "pseudolocale": "en"
  }
}

Which would generate xa locale based on English.

2) Pseudocalization - the actual transformation could happen during two CLI commands:

Hmm, I can't decide, because both options seems to be useful! What do you feel is the best option?

tricoder42 commented 6 years ago

Actually, if we do pseudolocalization at compile time, we can simply enable it in i18n or I18nProvider, without any message catalogs in filesystem or additional configuration.

This would enable PL based on English:

<I18nProvider language="en" pseudoLocalization>
   <App />
</I18nProvider>

This would work only in development.

What do you think?

sedubois commented 6 years ago

Going further, the pseudolocalization could be done with a cloud translation API like Watson etc?

MartinCerny-awin commented 6 years ago

@tricoder42 I think that it is better to have PL also in production. Sometimes it could be an acceptance criteria and it gets tested in production with feature toggles.

I would not strictly set that PL must be generated for English. The best would probably be to use default language. I meant the configuration to be used what locale code would be generated for PL, for example we could specify

{
  "lingui": {
    "pseudolocale": "xa-PL"
  }
}

This would generate PL in xa-PL.

The best would probably be to generate it during compilation. We could change default message in language we understand and it would be transformed latter.

@sedubois I do not have experience with API Watson. What would be benefit over using some Node library for example this one https://github.com/bunkat/pseudoloc ?

tricoder42 commented 6 years ago

@MartinCerny-awin Fair enough.

Regarding locale code, what about the other way round: en-XA would be LTR pseudolocale based on English. en-XB would be RTL. I'm a bit worried to use PL as a country code, but I've just checked briefly BCP-47 locales and the only conflict is pl-PL (Poland).

Or we could simply use xa or xb and it would be generated either from sourceLocale or fallbackLocale.

MartinCerny-awin commented 6 years ago

We do not have to strictly specify the country code. It was just an example. It would be on user to choose language code and country code. She could specify anything as a pseudolocale option.

tricoder42 commented 6 years ago

Make sense! Sorry for overengineering :)

Well, then it's settled:

Add pseudoLocale to config
lingui extract will create directory for pseudoLocale automatically (no need to run lingui add-locale)
lingui compile generates pseudo-localised strings using pseudoloc. The only tricky part is wrapping HTML tags with delimiter and then striping delimiters from pseudo-localised strings, so variables and rich-text works as well.

What do you think?

MartinCerny-awin commented 6 years ago

Yes, that sounds good. Just to confirm the psudoLocale code could be anything even though it is not valid BCP-47, am I right?

For the delimiter, I think do not think that we have to wrap delimiters. We can specify pseudoloc startDelimiter as a < and endDelimiter />. This would not transform variables and HTML tags.

sedubois commented 6 years ago

@MartinCerny-awin I was just thinking that by using a machine translation service, the translated text would be even closer to the actual translation. Nowadays machine translation became really good, at least that’s the case between English and French. Google for instance improved their deep learning algorithms a lot. Then, this machine-translated version could serve as a base for the translator.

It would be quite amazing if with just one ‘lingui extract’ command the whole app was entirely machine-translated.

Apps like https://github.com/OpenNewsLabs/autoEdit_2 (for video captioning) use a variety of machine translation APIs.

Anyway, please discard if it’s considered out of scope, just wanted to share what it made me think about ...

tricoder42 commented 6 years ago

@MartinCerny-awin I think start/end delimiters should be { and }.

Consider this message: Hello {name}, only Hello should be pseudolocalised. I think $$$$Hello$$$$ $$$${name}$$$$ should be passed to pseudoloc and then $$ should be stripped away (or any better delimiter).

Actually, thinking about it, plurals are tricky!

{value, plural, one {# book} other {# books}} where only book and books are pseudolocalised, so it's not as easy as setting start/end delimiters.

@sedubois That sounds good! Let's finish this first and then think about machine translations

tricoder42 commented 6 years ago

Released in v2.7.0. Thanks @MartinCerny-awin!

lingui / js-lingui

Pseudolocalization #296