lingui / js-lingui

๐ŸŒ ๐Ÿ“– A readable, automated, and optimized (3 kb) internationalization for JavaScript
https://lingui.dev
MIT License
4.54k stars 380 forks source link

Suggestion: Optimize the length of message ID #139

Closed fohrloop closed 3 years ago

fohrloop commented 6 years ago

Hi,

I have been thinking, that when translating longer sentences, it would be nice if the message ID was not the English (or whatever development language) message, but shorter, random ID. After thinking it a bit more, I came up this question: Could it be possible that, in production code, the message ID's are just integers ranging from 1 to # of translated strings? This would be some kind of "minification" process when building the bundles (with Webpack).

tricoder42 commented 6 years ago

Yes, similar idea was suggested in one issue. It definitely make sense.

Question is if message IDs should be replaced with integers or hash.

One thing I would like to support in the future are translations in 3rd party libraries. For example, when you import react-datepicker which is translated using jsLingui, CLI should pick messages also from this 3rd party library and add them to your message catalog. For this reason it might be better to use hash instead of integers, but still we need to find bullet proof way to prevent duplicates.

fohrloop commented 6 years ago

I would personally prefer integers to have even shorter message IDs, but maybe they both could exist as options. Handling duplicates could be something like this 1) Make message IDs for every string 2) Look at all duplicated IDs (if any). Compare the "original nontranslated strings". If they are the same, do nothing. If they are not the same, use the ID N+1 (where N is the highest number used so far). 3) Once done, repeat 2 as many times as there are no more changes.

tricoder42 commented 6 years ago

Good points. I think we could still use [0-9a-zA-Z] instead of integers to make IDs even more compact, maybe also include some chars? Doesn't exist any encoding which does exactly this?

fohrloop commented 6 years ago

Yeah thats even more compact. I think the name for that is "base 62" numbers (since there will be 62 different symbols). You can add as many special characters there if you want even more shorter IDs. But I guess there are not many people using even 1000 message IDs, so the real effect is minimal (compared to changing the original string to an integer). But of course, optimization is fun, so why not :)

Here is an online tool to play around with: https://jalu.ch/coding/base_converter.php

Edit: The algorithm behind the conversion is actually pretty straightforward. Copy-pasting from a SO answer (Python code)

# this list of symbols allows conversion of numbers represented until base 62
base_symbols='0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

def v2r(n, b): # value to representation
    """Convert a positive number n to its digit representation in base b."""
    digits = ''
    while n > 0:
        digits = base_symbols[n % b] + digits
        n  = n // b
    return digits

def r2v(digits, b): # representation to value
    """Compute the number given by digits in base b."""
    n = 0
    for d in digits:
        n = b * n + base_symbols[:b].index(d)
    return n

def b2b(digits, b1, b2):
    """Convert the digits representation of a number from base b1 to base b2."""
    return v2r(r2v(digits, b1), b2)
tricoder42 commented 6 years ago

Looks good! I'll see what I can do. This could probably land in next major release. I'll start writing release docs for features I want to include.

nicolas-cherel commented 6 years ago

From what I see there is two path for a reliable & sustainable way to handle this feature :

  1. The computed ids lands in the <lang>/messages.json
    โ—๏ธit is critical that ids are stable for long term understanding of what happened to translations, because any kind of re-ordering will be confusing, and/or introduce artificial changeset in translations versionning. Nothing but hashing can enforce this.
    โ†’ hashes can be longs, so it has to be used only for longer text
    โ†’ but it brings an implementation simplicity: calculate ids on extraction and transforms, and there you go. โ†’ you can use those ids if your intl toolchain needs ones.

  2. more optimized ids : int generated ids tided to a specific build
    โ—๏ธonly the transforms and compiler calculate the ids, activated by conf.
    โ—๏ธit's not clear for me right how when can generate consistant ids from messages.json files and source code, but seems definitely doable, maybe not in a 100% independant way. We will either have to rely on algorithmic consistency (eg. based on origin filepath:lino) or inject the id computation of compiler into code transform somehow.
    โ†’ simpler from a lib user POV

tricoder42 commented 6 years ago
  1. I really like simplicity of this solution. You could have it easily by setting a single switch in config.

  2. I would keep original messages both in source code and messages.json and replace IDs for production build only. This was my original idea, but it requires two step compilation of source code (1. compile & extract, 2. translate, 3. compile & replace IDs). The plugin, which transforms tags needs to be aware of message catalog, but that should be easy.

nicolas-cherel commented 6 years ago

(refinement on scenario 2 here)

After investigation I realized that the code transfomers does not handle all the cases covered by the extractor (such as i18nMark()). So we have to keep something based on the messages.json. Maybe the key index in messages.json file would be enough or we can add a generated and sanitized shortId.

nicolas-cherel commented 6 years ago

that said I realize that we'd like to see i18nMark calls to have their content replaced by the id anyways ๐Ÿ˜ธ

karlhp commented 6 years ago

Just for the records in case that anyone is interested in Javascript checksum/hash generation. "Generate a Hash from string in Javascript" https://stackoverflow.com/questions/7616461/generate-a-hash-from-string-in-javascript

tricoder42 commented 6 years ago

@karlhp It depends, hash might not be required at all.

If we have an app with let's say 1000 strings, then converting msgId to a number in base62 would be 2 characters long at maximum (up to 62^2 = 3844, actually). Three character long msgIds and we can have ~250k messages.

Only drawback, it requires two-step extract/compilation - first, to extract all messages from source and second, replace msgIds.

karlhp commented 6 years ago

@tricoder42 base64 sounds even better.

I wouldn't bother about two-step extraction/compilation at all.

karlhp commented 6 years ago

@tricoder42 I misunderstood your last comment with base62 msgIds or maybe I am missing something.

The purpose of checksums as I have meant it is to ensure that strings in the message files always have the same ID, as @np-8 points out in his comment:

"it is critical that ids are stable for long-term understanding of what happened to translations, because any kind of re-ordering will be confusing, and/or introduce artificial changeset in translations versioning. Nothing but hashing can enforce this."

The base62 msgIds are fine in the final built but what are you suggesting as Ids in the message files? Simply leave the strings as now? Why not use checksums in the locale message files upon extraction and base62 msgIds in the final production version only?

I would guess that creating the hashes wouldn't be a big deal. It would make the locale message files more readable and compact, regardless if they are in JSON, PO or XLIFF format.

tricoder42 commented 6 years ago

I don't think it matters what you're using as message IDs in message catalogs.

If you translate by editing message catalogs directly, then I admit that PO file is much more readable than JSON. I guess even JSON in lingui format (source and translations are on different lines) might be ok, but as I worked with PO files today, they are really beautiful :)

If you translate using online service or using editor, which is how most translators work, then the format doesn't matter at all. You just need to load it into editor and you get nice UI designed specifically for translating.

I probably don't understand your intention. I though you want to minimize the size of JS bundle. I definitely didn't want to change local message catalogs, just the build ones. I can't imagine translating a file with hashes and comparing hashes across catalogs in different languages.

karlhp commented 6 years ago

@tricoder42, my intention is readability (regardless who reads the files), size of translation files and that it makes somehow no sense to have any kind of Id which is a few hundred or thousand characters long.

In PO files it makes sense to use the text as message Id because it exists only once. Anyway, it is definitely not the most essential issue.

tricoder42 commented 6 years ago

You mean size of the file on local disk?

The ID serves also as a message source, which is translated (if you use messages as IDs). That's the only reason why you need to keep it around:

{
   "Hello World": "<translation>"
}

You might also use custom IDs, if you want. This is preferred method for many users:

{
   "msg.hello": "<translation"
}
karlhp commented 6 years ago

The size only matters if you edit translation files with long text strings with an online tool. I am aware that many won't bother about this.

"Hello World" isn't a problem but this one: ` { "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ": {

"translation": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. ", "origin": [ [ "src/pages/index.js", 45 ] ] }, ` I don't like custom ID's because I find them a nightmare to maintain.

tricoder42 commented 6 years ago

I understand your point. Still it seems to me like a very special case, because usually the local file size isn't a problem. I wonder what would be the size difference for a decent message catalog (1000-2000 messages).

As a workaround, you could write a very short script which replaces messageIDs with hashes before uploading them to online tool.

karlhp commented 6 years ago

The size will often not matter at all unless you have very long text messages, for example, i18n for documentation.

However, my feeling is that it just doesn't make sense from an engineering point of view though I guess this is debatable. Maybe there are others which insist to keep the current approach.

I don't need a fix or workaround right now, I am thinking long term to have an i18n library which I can use with confidence.

If time permits I could implement this myself and make it an option, I'll think about it.

tricoder42 commented 6 years ago

Documentation for Django has 25000 messages and they use simple gettext file:

# Messages are very long...

#: ../../howto/auth-remote-user.txt:5
msgid "This document describes how to make use of external authentication sources (where the Web server sets the ``REMOTE_USER`` environment variable) in your Django applications.  This type of authentication solution is typically seen on intranet sites, with single sign-on solutions such as IIS and Integrated Windows Authentication or Apache and `mod_authnz_ldap`_, `CAS`_, `Cosign`_, `WebAuth`_, `mod_auth_sspi`_, etc."
msgstr ""

#: ../../howto/auth-remote-user.txt:18
msgid "When the Web server takes care of authentication it typically sets the ``REMOTE_USER`` environment variable for use in the underlying application.  In Django, ``REMOTE_USER`` is made available in the :attr:`request.META <django.http.HttpRequest.META>` attribute.  Django can be configured to make use of the ``REMOTE_USER`` value using the ``RemoteUserMiddleware`` or ``PersistentRemoteUserMiddleware``, and :class:`~django.contrib.auth.backends.RemoteUserBackend` classes found in :mod:`django.contrib.auth`."
msgstr ""

All catalogs (without translations) are 10kb in total, so that could be about 20kb including translations.

I really feel any optimization here is unnecessary and will make translation process more complicated.

karlhp commented 6 years ago

It would still be almost the double size in JSON. Though I agree this is not a real problem in most cases.

I would also use the PO format anyway once it becomes supported. Just one more question; is the ICU format in source code actually compatible with the PO format in message files?

tricoder42 commented 6 years ago

PO is file format, while ICU is message format. PO format is already supported since latest release 2.4. You can use any message format inside PO file, this is usually indicated using flags, so there's no problem with using ICU MessageFormat syntax inside PO file.

karlhp commented 6 years ago

Thanks, PO files are working but I think I start to understand the problem with internationalization message formats, in particular with JSON:

If Your Only Tool Is a Hammer Then Every Problem Looks Like a Nail.

The JSON format is perfect for data transfer but neither for configuration nor message files. JSON is simply the wrong tool for this and PO files lack modern requirements (no real ICU message format standard/support). This is nothing specific to js-lingui but looks to be a common standard in Javascript, respectively nodejs land.

So long-term I am looking forward to XLIFF format support which looks to be the only right tool for i18n tasks.

Nevertheless, js-lingui is great and it will remain my first choice for i18n. Thanks for all your efforts so far!

bjrn commented 6 years ago

Regarding hashing of message id's โ€” We're manually adding message id's (using react-intl for translations), but while checking out translation alternatives a while back ago I stumbled upon https://github.com/guigrpa/mady, which has hashing of ids and compilation of messages built-in โ€ฆ might be some tried and tested implementation ideas there?

stereobooster commented 5 years ago

there is also z85 - string safe variation of base85 (85^2=7225)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.