wmertens commented 1 year ago

Qwik-native i18n

Continuing the Discord conversation here.

We want to have an automated mapping from keys to strings, depending on the user locale.

Requirements

SSR support with 0 JS on page load
Works without server
Embed parameters into result
Work outside of component$ context (ok to require extra data)
Work when some/all translations are missing
Warn when translations are missing for prod build
Allow changing the "base" translations without changing the code
Fallback locales, e.g. en_us -> en -> C

Things to minimize

bytes shipped to client
CPU at runtime
memory at runtime
delay at build time

Things to maximize

developer happiness
ease of updating translations

Bonus points

allow changing language without reloading page

Things to maybe allow

These are probably nice and it would be good if they are potentially possible without changing the API later:

parameters can influence translation strings, e.g. ${count} items vs ${count} item
dynamically add to translations after build, in prod
grouped translations (e.g. categories, tags)

Contexts

inside and outside the component tree:
- Inside the tree, we can inject context near the root and react to signals
- Outside the tree, we need to provide all context and we can only await promises
SRR vs client:
- During SSR we can use JS, and we can have all translations available in memory
- On the client, we can't use JS until the page is loaded, and we have only a subset of translations in memory

Approach

API

We'll use template strings to allow embedding parameters into the translation strings. We'll let the dev choose the prefix; for example:

import _ from "@builder.io/qwik-i18n";

const MyComponent = component$(() => {
  const $ = useLocale();
  return <div>{_`Hello, ${name}!`}</div>;
});

For outside-of-tree use, we'll need the locale to be passed in explicitly. We can let the template function also be a function accepting the locale, then returning a template function that returns a promise:

import _ from "@builder.io/qwik-i18n";

const logger = async () => {
  console.log(await _("en_us")`Hello, ${name}!`);
};

Template strings are converted to keys for mappings. Parameters are replaced with $#, for example:

_`Hello, ${name}!` -> "Hello, $0!" -> "Bonjour, $0!"

Conceptual implementation

The _ function will manage a singleton store of all used translations. For SSR, it will eagerly load all locales. On the client, it will only load translations when they are used.

We consider the template string to be written in the C locale. If a translation is missing, we'll use the C locale as the fallback.

We'll use a qwik-i18n build step to extract all template strings from the source code and generate a JSON file with all translations, per language. This JSON file will be loaded by the _ function as needed. We'll also optimize the function calls, see below.

The _ function therefore maps from C to the desired locale, loading the translations as needed. Inside the tree it returns a component that uses a store to get the translations. Outside the tree it returns a promise for the translation.

All files that need to be maintained are stored under /i18n, and the resulting data files are stored under /public/_i18n.

If a translation is missing, _ will try to load the locale, the fallback locale, and finally the C locale. If the translation is still missing, it will return the key.

Optimizations

Since Qwik can recover text nodes for serializing stores, we must ensure that translations are added verbatim to the DOM. Furthermore, we want to ship as little data as possible to the client. We'll start each SSR with an empty store in the i18n context, and it will be populated by _ calls. This means that at the end of SSR, the store contains only the used translations, and Qwik will reuse the text nodes. Only when parameters are used, the text nodes will differ from the store data.

On the client, we'll use the store to populate the translations singleton, and load additional translations as needed.

Having a single JSON per language means that to look up one translation, all translations are loaded. We can improve this by splitting the translations into multiple files.

First, we'll map from C locale to an index. This index is then used as the key in the locale's JSON file, which now becomes an array. We'll split the JSON array into multiple files, each containing e.g. 15 translations. We'll use the index to determine which file to load.

A bonus is that the JSON arrays don't need keys any more, saving a few bytes per translation.

During the build step, we'll maintain the C index mapping. Any existing mapping is retained, and new translations are added to the end of the array. This means that translations will always retain their index, even when new translations are added.

Importantly, this means that indexes close in number are also close in application context, which may mean that loading an array subset to satisfy a single translation also loads other translations that are likely to be used.

Since we know the C index mapping during the build, we can replace the _ template calls with the index. For example, _`Hello, ${name}!` maps "Hello, $0" to index 4 and the call becomes _(4, name), saving yet more bytes. However, to allow for signal propagation, we in fact replace the call with the resulting element, namely <I18n id={4} params={[name]} />.

If the _ function is still called as a template function, we load the mapping file and look up the index. If not, the mapping file will never be loaded. This provides a fallback during development.

Mapping C to an index and chunking the locale arrays makes it hard to maintain the translations manuallly though, so we'll let the the translations be managed as YAML files (which are a superset of JSON). The build step will convert the YAML files to the chunked JSON arrays. The build step will also add missing keys to the YAML files.

To allow for dynamic translations, the _ function can load extra mappings in any convenient way.

Grouping translations can be done in many ways, so we'll leave that for now, confident that we can add it later without issues.

To allow for varying translations based on the parameter, we can allow the translation string to be an object with the key values of $0 being used to select the translation. For example: _`${count} items` can map to the translation object { "1": "1 item", "_": "$0 items" }.

Conclusion

This approach seems to tick all the boxes, with minimal data transfered during SSR.

The net result is a /i18n folder containing C.json (an array of all encountered template strings) and per locale a locale.yaml file (containing the translations). The build step will generate the chunked JSON files from the YAML files under /public/_i18n, and the _ function will load the JSON files as needed.

This requires a Vite plugin that can detect and update all the calls done with the _ function, as well as maintaining the C.json file, adding missing keys to the locale.yaml files, and generating the chunked JSON files.

wmertens commented 1 year ago

Probably instead of putting the json files in /public they can go in the build as well. Vite can probably handle imports with interpolations.

cwerner1 commented 1 year ago

For the pluralization of the variable strings: please look at other frameworks how they implement this: Lavarel comes to my mind now: https://laravel.com/docs/10.x/localization#pluralization It allows rules for choosing the right string variant, depending on the number: for example 'Apples' => '{0} There is none|{1} There is one|[2,*] There is :count', or 'apples' => '{0} There are none|[1,19] There are some|[20,*] There are many',

Not all languages have the simple rules like zero and many.

wmertens commented 1 year ago

@cwerner1 good point, although I prefer not parsing the text too much, so I'd rather do something like {0: 'tr for 0', 1: 'tr for 1', 2: 1, 3: 1, _: 'tr for all'}, where 2 then uses the translation for 1.

wmertens commented 1 year ago

Here is a prototype of what the runtime looks like: Qwik Playground

(I needed to pin 1.1.5 because 1.2 has a broken playground, and for some reason the language change causes an import error in dev mode so it's running in prod).

Note the '...' and loading messages when changing languages, but only one time.

Also note that the initial html only contains the translation Aardvark once, in the HTML, and not in the store.

wmertens commented 1 year ago

I added parametrized translations to the prototype: Qwik Playground

However, this somehow puts Aardvark into the serialized store even though it shouldn't. I'm guessing this is a fixable problem though, since in the previous prototype it doesn't include Aardvark.

imoldfella commented 1 year ago

what are your thoughts on attributes? will these be supported by the vite transformation?

<button aria-label={_`aardvark`} >foo</button>

wmertens commented 1 year ago

@imoldfella hmmm didn't think of that one, that's indeed harder because _ normally returns a component. So either the user has to use a different call or the transform has to change depending on context.

In this case the $localize approach just works...

imoldfella commented 1 year ago

what do you think about vite transforming all the _`` into signals? They seem pretty light weight, although not completely free.

wmertens commented 1 year ago

Hmm that could be done if _ came from a hook instead of an import 🤔

cwerner1 commented 1 year ago

@wmertens Your suggested rules is only usable for a limited list of languages: on this page: https://lingohub.com/blog/2019/02/pluralization under "CLDR Overview" is a table with multiple languages and their rules for certain numbers/ pluralized strings. For example the Slovenian language has different rules depending on the mod of the number, or another different rule for the french language is: for the number 0 and 1 is the same translation used.

About 9 years ago I have integrated a pluralization engine on a top of a Zend Framework application and I solved the problem in this approach:a I would recommend a simple translation function_, which doesnt handle any pluralization and another, something like _p(string, number, ...args) which can handle this kind of stuff. Here would also be a place to integrate the rules engine for the different languages.

wmertens commented 1 year ago

@cwerner1 I don't understand, can you give an example that you can't express using something like {0: "0", 1: 0, "2": "2", "_": "*"}?

Did you try the playground I mention here?

I'm also more and more convinced that putting the application strings in the build is a really good general approach, so then pluralization would have to be embedded as well.

PS note that the pluralization here is orthogonal to i18n itself.

tzdesign commented 1 year ago

@wmertens @mhevery is there something in the work or just a list now?

We implemented qwik-speak and there are several things we don't like:

Extract is self-written and optional chaining breaks extracts for usePlural
In some cases passing the translation into props is causing errors (We define them in the function body 😅)
keys are more or less standard. I think optional would be better for edge cases otherwise always have the base-languages text in the function

The coolest thing ever would be to implement auto-extraction of all jsx texts.

It might be interesting to check apples latest changes to localization. Especially how they work with variations and how the dictionaries are build.

https://developer.apple.com/documentation/xcode/localizing-and-varying-text-with-a-string-catalog

Looking forward to hear from you guys.

mhevery commented 1 year ago

We implemented qwik-speak and there are several things we don't like: But this repo is not for qwik-speak.....

Are there things that you don't like with $localize approach?

tzdesign commented 1 year ago

We are implementing it again with localize and will let you know.

mhevery / qwik-i18n

drop $localize, go all-native: a roadmap #8

Qwik-native i18n

Requirements

Things to minimize

Things to maximize

Bonus points

Things to maybe allow

Contexts

Approach

API

Conceptual implementation

Optimizations

Conclusion