Reintroduce pseudo-localization

zbraniecki commented 6 years ago

Coming back from the Unicode Conference, there was a lot of chatter about pseudo-locales.

Fluent already had a pretty good support for pseudo-locales in the past and due to our client-side mode, we offer an exciting approach to pseudo-locales - runtime pseudolocalization.

I'd like to bring back this: https://github.com/l20n/l20n.js/blob/v3.x/src/lib/pseudo.js to modern fluent.

@stasm - do you have any thoughts on how would you like it to work?

zbraniecki commented 6 years ago

Maybe just introducing some generic "post-processing" on messages in MessageContext, and then adding fluent-pseudolocale would work as the first step?

let ctx = new MessageContext(['ar'], {
  process: fluent_pseudolocales.transform.bind('ar-XB')
});
let msg = ctx.formatValue('l10n-id');

stasm commented 6 years ago

Maybe just introducing some generic "post-processing" on messages in MessageContext, and then adding fluent-pseudolocale would work as the first step?

That was my first thought as well. A few additional thoughts below. I'll try to have answers tomorrow.

We have to take into account how this will interact with the language negotiation. Would we expect the user to set their requested locale to a pseudolocale in order to enable it? Would we require that developers add pseudolocales to the list of available locales in their app?

Perhaps it would make sense to encode pseudolocales as Unicode extensions to BCP47? Something like ab-CD-u-pseudo-accent or ab-CD-u-pseudo-rtl. The language negotiation process would then still correctly pick the regular ab-CD for fetching translation resources. Some logic would then be responsible for transforming the fetched resource using the fluent-pseudo module.

What should be the outcome of formatting a date or a number in a pseudolocalized translation?

Also, the first step might be to only support build-time pseudolocalization.

zbraniecki commented 6 years ago

Google just went for en-XA, and ar-XB and added them to CLDR. So we can get internationalized date/time from CLDR 31 if we use those two.

Now, my problem with this is that because they used en-XA and not fr-XA, the numbers still look the same. I recommended them fr-XA, but it may be too late.

Since we'd be doing runtime pseudo, maybe we don't need extensions (and they wouldn't be unicode extensions, but rather variants ( Google originally used en-psaccent and ar-psaccentrtl or sth like that).

Maybe all we need is:

let ctx = new MessageContext('pl', {
  process: pseudo
});

and it'll transform polish strings? This way we could get a pseudo of the current locale, irrelevant of what it is.

stasm commented 6 years ago

How would you decide when to turn pseudolocalization on? A different logic independent of the language negotiation?

zbraniecki commented 6 years ago

How would you decide when to turn pseudolocalization on? A different logic independent of the language negotiation?

Yeah! This way the user can either detect pseudo from a langtag (oh, you're using XA region?) or by some checkbox (show me pseudolocale). Since we're on client-side at runtime, that would mean no rebuilding, restarting or anything. Just take the exact locale we use, whatever it is, and recompute for pseudo.

stasm commented 6 years ago

Yeah! This way the user can either detect pseudo from a langtag (oh, you're using XA region?) or by some checkbox.

If the user sets their requested to en-XA and the available only have en-US, the result of the language negotiation will be en-US. At the moment when we’d create the MessageContext we wouldn’t know the region was XA.

On the other hand, if we add en-XA to the list of available locales, and the files for it do not exists on disk, we will fail to fetch anything. We’d need to extend the IO logic to fetch them. This might mean moving the pseudolocalization to fluent-web. I think it would be better to have it on a lower level though.

Does the language negotiation preserve extensions found on the requested locales?

zbraniecki commented 6 years ago

Oh, you're right.

I think it would be better to have it on a lower level though.

I agree.

Does the language negotiation preserve extensions found on the requested locales?

yes.

So, maybe private-extension? fr-FR-x-pseudo ? We just need to make sure that if we see a pseudo like this, we actually feed en-XA, ar-XB to Intl API (so that CLDR picks it up)

stasm commented 6 years ago

There's discussion in http://unicode.org/cldr/trac/ticket/3971 and http://unicode.org/cldr/trac/ticket/9819 on why CLDR didn't go for variant tags. It's mostly about compatibility with existing code. Also, since en-XA and ar-XB are now in CLDR we should stick to these codes. I wish we hadn't missed the discussion when it happened.

I'm reconsidering my stance on where this logic should live. Having it higher up, e.g. in fluent-web would allow multiple approaches:

build-time transformation,
AST transformation right after IO,
string transformation after ctx.format().

fluent-web can also transform translations in a way which preserves HTML for the overlay mechanic.

Pike commented 6 years ago

I'd think that the most accute way to implement the actual pseudo localization would be on the AST?

stasm commented 6 years ago

On buildtime or on runtime?

Pike commented 6 years ago

For both, I guess.

stasm commented 6 years ago

Transforming the runtime AST means doing the transformation inside of MessageContext. That still might a viable option given my earlier comments: fluent-web could supply a markup-aware transform function to the MessageContext constructor. Compared to transforming the result of ctx.format(), this would have the advantage of only transforming TextElements in the translation rather than the whole string.

We still need to solve the problem of fetching valid locale files. Given that it looks like fluent-web (or fluent-react) would need to handle the pseudolocalization anyways (if only to be HTML-aware), I think it makes sense to special-case en-XA and ar-XB in their IO.

For example, given the following result of language negotiation:

requested: en-XA, de
available: en-XA, en-US, de
default: en-US
negotiated: en-XA, de, en-US

…a developer using fluent-react will need to add a special case to the IO code which fetches en-US when en-XA is requested. This sounds okay to me since the same developer has already put en-XA among the available locales.

let ctx = new MessageContext(negotiated, { pseudo: makeAccent });
ctx.addMessages( /* en-US translations to be transformed into en-XA */);

Or, if the build pipeline is capable of building pseudolocales up front, the IO code would simply fetch the pre-made en-XA files.

let ctx = new MessageContext(negotiated);
ctx.addMessages( /* en-XA  translations generated on build-time */);

zbraniecki commented 6 years ago

I do not agree with Stas that we have to use en-XA and ar-XB here. I believe it's perfectly fine for us to use whatever mechanism we want to use to recognize pseudolocales, and then just make sure to collapse in Intl constructor onto en-XA and ar-XB for Intl API / CLDR.

stasm commented 6 years ago

Note that the approach from my previous comment will work with any scheme of specifying pseudolocales. In my example I chose to put en-XA in requested but it could also be an app-specific pref which handles that. This is also how I understood your comment from 5 days ago.

There's value in using en-XA, ar-XB now that they were standardized in the CLDR. They will become recognizable names for pseudolocales and with time will gain support in various tools and platforms.

zbraniecki commented 6 years ago

@stasm - would you have time to draft a plan to get this into a POC state? I'm happy to commit to work on that, but would prefer to follow your vision.

zbraniecki commented 6 years ago

Some POC prototyping gave me this: https://youtu.be/E3t8-u8e5D0

It's actually quite simple to get to that point, and even get Intl hooked in. There's going to be more work to be done to get complex messages handling.

I'm wondering if it's better to import pseudo for side-effects and allow itself to hook into fluent:

import "fluent-pseudo";

let cx = new MessageContext(locales, {
  usePseudo: true
});

or make people hook it explicitly:

// strategy1 - 30% longer via duplication of vovels, larin chars transformed, LTR
import { strategy1 } from "fluent-pseudo";

let cx = new MessageContext(locales, {
  transform: strategy1
});

Enough for now, will wait for stas :)

stasm commented 6 years ago

I find the explicit version easier to understand. It will also be easier to test.

Pike commented 6 years ago

From a developer point-of-view, I don't expect that any Firefox developer will be touching code at that abstraction level. We explicitly don't want these folks to know that MessageContext even exists.

stasm commented 6 years ago

Agreed. IIUC this issue is about the low-level API which fluent-web will completely hide.

stasm commented 6 years ago

@zbraniecki and I talked about this yesterday and today. We'd like to start simple with the approach from comment https://github.com/projectfluent/fluent.js/issues/83#issuecomment-337954325.

The MessageContext constructor will accept a process or transform option whose value is a function to be invoked on all TextElements.
- The transformation would happen inside of the MessageContext.addMessages call in the runtime parser.
- We'll publish a new package with the psaccent and psbidi transforms. We'll discuss the exact strategies and implementations later.
- Users are free to write their own transform functions. We encourage experimentation.
For now, we'll use regular language tags: en-US, de, etc. The transform function should only be passed to the constructor if the user has expressed interest in using pseudolocales. This should be handled outside of Fluent.
- As a consequence, formatted dates interpolated into pseudolocalized translations will be spelled normally (e.g. Tuesday if the current locale is en-US).
In the future, we'll have Intl.Locale (and fluent-locale) and it will be easy to recognize well-formed BCP47 variant tags, e.g. en-US-psaccent and en-US-psbidi.
- CLDR's en-XA and ar-XB are called such mostly because of legacy code in Android which wouldn't handle language tags with variants.
- @zbraniecki will start a discussion with CLDR about using language variants for pseudolocales.
- In even farther future we might try to standardize the variants with IANA.
  - If variants get standardized, MessageContext could by default include transforms for known pseudolocales.

Pike commented 6 years ago

Users are free to write their own transform functions. We encourage experimentation.

This issue uses the word user for a ton of things, I'm loosing track.

Say, I'm a firefox developer, and I want to run my local build with psaccent on. How would I do that, and which parts of our code stack are involved in doing so, and what would they need to do?

stasm commented 6 years ago

This issue uses the word user for a ton of things, I'm loosing track.

Good point. I meant the users of the library here. Elsewhere I meant the user of the app.

Say, I'm a firefox developer, and I want to run my local build with psaccent on. How would I do that, and which parts of our code stack are involved in doing so, and what would they need to do?

You would start by flipping a pref somewhere in the UI. The values of the pref could be: psaccent, psbidi. fluent-gecko (which is fluent-dom packaged for Gecko privileged content) would observe this pref and use in its generateMessages which constructs MessageContext instances. fluent-react in Devtools would need to do the same.

zbraniecki commented 4 years ago

Can we close this issue? We have capability for pseudolocalization since fluent 0.7 and we use it in Gecko. Or should we wait until we extract fluent-pseudo as a package (I have that in rust - https://github.com/projectfluent/fluent-rs/tree/master/fluent-pseudo )

julienw commented 3 years ago

Hey @zbraniecki, by chance would you have some guidance or documentation about how to use pseudolanguages with fluent.js/fluent-react in a plain web page (as opposed to in Firefox)? Thanks!

zbraniecki commented 3 years ago

hmm, I can tell you how to enable it in fluent.js, not react You need to extract from an old L10nRegistry.jsm https://hg.mozilla.org/mozilla-central/file/a1f74e8c8fb72390d22054d6b00c28b1a32f6c43/intl/l10n/L10nRegistry.jsm#l425 and then when constructing FluentBundle you pass a method as transform - https://github.com/projectfluent/fluent.js/blob/master/fluent-bundle/src/bundle.ts#L61 I assume something similar happens for react, but I'm short on details if you do spend time, I'd accept that resurrected block of code as fluent-pseudo in fluent.js repo to maintain it!

julienw commented 3 years ago

Thanks for the pointers! This is what I did to support pseudo locales in the profiler: https://github.com/firefox-devtools/profiler/pull/3188

We enable a pseudo locale by calling a function in the devtools console.

Would the file https://github.com/firefox-devtools/profiler/pull/3188/files#diff-ca1e6802f7be91e16b4123f89f090a2c40053a53e52b73ed3d69469619179d24 be suitable as fluent-pseudo? I'm not sure how "bidi" would set "rtl" with "fluent-dom", do you know? Or maybe fluent-dom doesn't set it anyway, like fluent-react?

zbraniecki commented 3 years ago

yeah, it looks good!

For a while we used a hardcoded list which is quite stable - https://github.com/mozilla-b2g/gaia/blob/master/shared/js/intl/l20n-client.js#L31-L35

projectfluent / fluent.js

Reintroduce pseudo-localization #83