Requirements - MF wishlist

romulocintra commented 5 years ago

List of requirements to consider for MF

romulocintra commented 4 years ago

I'am listing requirements from the 1st meeting slides :

List of possible requirements

Easier to use ICU “Select”
Fluent could be considered as a starting point for the future of message format
Have pluggable “formatters”(Date/Time/Number ...)
HTML Markup
Cross-platform / Universal Format
Messages should have more context “description” or ”metadata”
MessageFormat - More Readable
Escaping(“ or ‘ ) and Interpolations (html tags)
Rule Modifiers - Send Message or Send SMS -> similar to select ICU feature
Improve Translators / Developers UX/DX
I need to somehow be able to cache my translations
Use Yaml or JSON as file format
Message reference - from another Message

zbraniecki commented 4 years ago

Proposal for an additional requirement:

Provides a translation of an XML/HTML element.

jamuhl commented 4 years ago

Sorry, I wasn't there in the first meetings so I'm not sure what is meant with "HTML Markup"?

But:

fully agree on custom pluggable "formatters"

And add:

extended plurals, like:

{ count , plural ,
   =0 {No candy left}
  one {Got # candy left}
  <10 {Got a few candies left}
  10-20 {Got a handful candies left}
other {Got # candies left} }

edit: in i18next we use a postProcessing plugin to achieve that: https://github.com/i18next/i18next-intervalPlural-postProcessor#usage-sample

zbraniecki commented 4 years ago

HTML Markup

Ability to interpolate localization with HTML. Example:

<span>You have <b>6</b> unread messages from <img/> Mary.</span>

Fluent provides DOM Overlays which are heavily used in Firefox l10n - https://github.com/projectfluent/fluent.js/wiki/DOM-Overlays

jamuhl commented 4 years ago

@zbraniecki thank you for explaining...so basically take the innerhtml element(s) and extend it with the attributes and content contained in the translation...looks similar to the Trans component we have in react-i18next -> https://react.i18next.com/latest/trans-component (just we have no html elements but react components)

edit: guess we could mimic DOM-Overlays by extending our Trans component...just not sure if this is part of the syntax or an extension that is provided by the i18n library?

romulocintra commented 4 years ago

@mihnita should i reference here the your entire document or we can break it in features to add here ?

zbraniecki commented 4 years ago

In our experience innerHTML in particular is a no-go for security reasons (l10n resources are treated as a third-party). I expect the requirements from the W3C to be similar here.

Instead, we whitelist allowed textual elements (<sup/>, <sub/>, <span/> etc.) and for everything else we require the developer to provide the elements in the source with a name, and then the localizer can position them using the same name:

<p data-l10n-id="key1">
  <a href="https://www.mozilla.org" data-l10n-name="link"/>
  <img src="./pics/img1.png" data-l10n-name="logo"/>
</p>

key1 =
    Welcome to <a data-l10n-name="link">Mozilla</a>!
    Please, click on <img data-l10n-name="logo"/> to proceed.

That's significantly more involved than innerHTML, but the end result is quite similar with a lot of linting, security, and sanity checks. We're also discussing further extensions - https://github.com/zbraniecki/fluent-domoverlays-js/wiki/New-Features-(rev-3)

jamuhl commented 4 years ago

innerHTML was more referring to the content than to the implementation detail...same reason we do not just append translations into a react element by using dangerouslySetInnerHTML ;)

mihnita commented 4 years ago

I will break into features.But maybe also link, so that others can read the complete doc.I think that the current list of features will also need to "grow" with some more details. As it is some of them are so short that only the one who proposed it really understands what it means :-)MihaiOn Jan 6, 2020 11:41, Romulo Cintra notifications@github.com wrote:@mihnita should i reference here the your entire document or we can break it in features to add here ?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

romulocintra commented 4 years ago

@mihnita

If you can break the into features great and link is important to Both are important
I completely agree that some of features wont fit in one line and will need more detail, that ones IMHO deserve a unique issue or thread.

My Proposal :

If you can break it into features will be perfect(agree that the link is important too)
Some of the features won't fit in one line description needing more detail, that ones IMHO deserve a unique issue or thread, I suggest that we can create a new Issue tagged as "requirements", where we have all detail and discussion about that issue, but we can keep a reference with description here to keep the list in only one place.

I feel that also the short description ones will grow to have their own issue/task, but I think we can figure out later after we groom and filter the tasks/lists of requirements

longlho commented 4 years ago

My proposal for the process @romulocintra is to set a deadline, then de-dupe the list, then prioritize into mvp, v1, v2... so we can move this along.

romulocintra commented 4 years ago

My proposal for the process @romulocintra is to set a deadline, then de-dupe the list, then prioritize into mvp, v1, v2... so we can move this along.

@longlho i believe this(process , mvp , roadmap , goals) must be addressed in #4 where we can define all related organizational and process as a team.

Related with this task and regarding how we organize the list, I think the previous proposal can fit our current needs, I did not propose any deadline for this task but I see next meeting as a good candidate to prioritize/filter/de-dupe the items originated in this thread. finally, we can review #4 to close all the organizational issues, deadlines and goals.

Meanwhile, I'm referencing your comments in #4

PS: just added this topics to the next meeting agenda

MickMonaghan commented 4 years ago

Right now, in ICU4J, if you do: "You owe {someNumber, number, currency}." - then the actual currency is inferred from the current locale - which is just nasty.

You can do this: "You owe {someNumber, number, :: currency/JPY}." - but this means that you know in advance that you're dealing with a specific currency - JPY - in this case. One should be able to declare the actual currency at run time. Perhaps Fluent already supports this?

nbouvrette commented 4 years ago

Sorry for joining the conversation late and having to leave the last session early but here is my take:

Make the syntax cross-language/cross-platform. Maybe having an RFC and/or improved (non-technical) documentation of the syntax would help?
See if we can make the syntax easier to read (not just for developers, but presuming "raw" syntax could also be translatable by linguists)
Provide free tools with the syntax for authoring and translation (our own online CAT tool?)
Extend selectors (I like @jamuhl's example and will have other to present in the next session)
File format-agnostic - not all TMS does a good job supporting file formats. If the syntax is independent it makes it more flexible to adopt
Leave the syntax markup (e.g. HTML) agnostic - the syntax should be able to accept HTML or any other markup but the TMS and or library can implement manipulation how it find best for its use case
Find better ways to escape the syntax (' is way too common and the current escape patterns could be possibly standardized/simplified)
Add more features:
- Predefined Linguistic selectors (will be presenting this idea in the next meeting)
- Improved list support
- Better currency support
- More flexible formats (extendable inline?)
- Numbers to "written numbers" convertor?
- Inflections (genders, articles, declensions, etc.)

MickMonaghan commented 4 years ago

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string? This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

zbraniecki commented 4 years ago

Perhaps Fluent already supports this?

Fluent does support it, it's called "partially formatted variables" and currency was the particular example that drove that feature.

The way it works in Fluent is this:

ctx.format('product-cost', {
  amount: FluentNumber(342, {
    currency: "JPY",
  })
});

// Translation can just use "default" formatting options
product-cost = This product costs { $amount }

// Or a translation can specify its own list of options (based on ECMA402 NumberFormat

product-cost = This product costs { NUMBER($amount, minimumFractionDigits: 3) }

An important bit is that the selector (NUMBER) limits which options can be provided by the translator - in case of number, currency is not available for the localizer to specify.

zbraniecki commented 4 years ago

Provide free tools with the syntax for authoring and translation (our own online CAT tool?)

Fluent comes with a CAT tool - https://github.com/mozilla/pontoon / https://pontoon.mozilla.org/ A lot of effort in Pontoon at the moment goes into better WYSIWYG for Fluent selectors.

Leave the syntax markup (e.g. HTML) agnostic - the syntax should be able to accept HTML or any other markup but the TMS and or library can implement manipulation how it find best for its use case

I'm not sure if I agree. Features like compound messages are important only when you look at the problem in context of UI widgets. The drive to be agnostic may lead to a syntax that is not really optimized for anything. While I agree that we should ensure the syntax and data model are useful for wide range of software use cases (and not, say, just for Web/React), having some "P1" targets would help us bring something actually useful imho. In particular, from my angle, understanding that Software UI is not created by a bunch of imperative calls from JS/C/Java, but is usually defined in some declarative markup is fundamental to how you design features. If we reject this hypothesis, it will have deep implications on what we end up with.

grhoten commented 4 years ago

I previously gave a presentation called Let's Come To An Agreement About Our Words. The presentation covers an older format that we used in Siri, and we're migrating to a newer simplified format. Here are some highlights on what it can do or found was desirable.

It's generally an XML format. The original would use something like ECMAScript/Java beans/UEL for referencing variables and its properties. The UEL syntax was too complicated and was changed to favor more XML with a nicer editor, much like your favorite word processor stores its data in XML without the end user really knowing that low level detail. It's also easier to interchange it with XLIFF when it's XML.
Support for SSML is very desirable for screen readers or virtual assistants.
The messages are by default both printable and speakable, but you can exclusively print or speak a phrase. If you ever need to explicitly speak a number within a given context, this is critical.
Word inflection and grammeme detection (values of grammatical categories) are fundamental parts of the syntax. It's critical functionality with user provided vocabulary. Generally, you need to know the grammatical number, grammatical case, the grammatical gender of the words and the pronunciation of the word (generally just if the word starts or ends with a vowel).
Word inflection can include adding prepositions, articles, pronouns or grammatical states of a given word. For complicated examples, check out Russian, Korean or Arabic.
Number pronunciation is provided by CLDR's RBNF.
Getting a number and noun into grammatical agreement is critical. The grammatical gender of the number comes from the noun. The grammatical number of the noun is generally affected by the value of the number (e.g. 1 or 2). The grammatical case is defined by the translator given the context of the sentence. The translator does not provide the exact inflections by default.
List handling involves inflecting each word. This might mean making each item the definite form.
The "and" (AKA conjunction) list, and the "or" (AKA disjunction) list are able to handle the context correctly for Italian, Spanish and Korean.
There is also the adjective list, which is probably the hardest to get correct for English. For Chinese and Korean, it's a lot easier.
There is a calendar concept based mostly on CLDR's translations. Some functionality is provided to add preposition or postpositions as needed. The grammatical case can be modified as needed. CLDR doesn't handle grammatical case modification that well by default.
There is a measurement concept that is separate from CLDR's implementation to provide precise translations of units of measure, like kilometers and miles. CLDR is more focused on the printable form instead of the speakable form, which is why CLDR is generally ignored when the speakable form is also needed.
It has a highly customized currency concept. CLDR only partially covers support for this functionality. Pronunciation of a currency for its units and subunits in native and foreign contexts is important.

This functionality works or is shipped on Linux, macOS, iOS, tvOS and watchOS. The watchOS support is probably the important thing to highlight because it is the most resource restrictive environment to support. I'm just stating that this functionality can live in resource constrained environments where grammatical correctness of a message is important.

zbraniecki commented 4 years ago

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string? This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

While we definitely experienced a very vocal community of users of Firefox who want to use different translation from locale formats, this has also been a trap for regular users because date/time formats often contain translations.

For example, Japanese 2020年1月13日星期一下午12:03:10 or 星期一下午12時 (for { weekday: "long", hour: "numeric" }) would be very confusing if placed in a sentence with different locale.

There are even extreme cases. If the user had german translation, with a date that is formatted in en-US, there's a chance of flipping MM/DD and DD/MM order. If the sentence is in german, user has the right to interpret the "05/08" using german "DD/MM" pattern, and be very surprised if they later learn that it was actually en-US "MM/DD` taken from their OS locale formatting preferences.

My initial position is that we generally should, by default, format placeables (numbers, dates etc.) using the same locale as the translation is in, and allow for the develop to provide an alternative language negotiation for formatters in order to handle exceptions like you mentioned.

This is also important once we start talking about the error handling UX. Fluent has been designed to fallback using a locale chain, so if there's an error or missing string in the primary language, we'll fallback on the second best choice, rather than display an error and break the app. It's an important resilience measure for us. What's interesting is that that means that the locale chain used for formatters is per-bundle so that in the locale context ["fr-CA", "fr", "en"] we first try to localize a message in fr-CA using fr-CA formatters, but if there are errors and we end up localizing the message using en resources, we'll format the date/times using en locale.

zbraniecki commented 4 years ago

@grhoten - this is awesome! Thank you for sharing!

We have some experience with TTS in form of Common Voice project which uses Fluent.

While I don't see it in the translation resources they use now, I remember that in some variant of the project they used fluent's compound messages to represent the spoken/written difference:

time-is =
    .written = { $time }
    .spoken = The time is { $time }

It was an unexpected use of the compound messages, but brought up the idea that having message variants that are recognized as a single unit (with comments, invalidation rules, fallbacking together etc.) is important.

mihnita commented 4 years ago

Most OSes allow for a separation between the formatting locale and the resource locale, but it is not always explicit.

It is a really useful thing for regional variants. Most applications are localized into Spanish, French, Arabic, etc. Rarely there is a "flavor" like Spanish-Latin America

But there are tens of countries using each of these languages, and they use different date / time / number formats.

So for the user it is best if one can use the French-Swiss locale (for example), and that will format things for fr-CH, but load the fr resources, with fallback.

If the fallback is granular enough (for instance on Android and Java it is string level) then one can have (for example) everything translated into French, and a document (or string) for fr-CH to cover country specific stuff (think legal, or special functionality)

Not all systems have a way to tell that the strings really come from "fr". The "application locale" is fr-CH, and the is used for everything.

So you never get weird mixtures like French strings + German dates.

But I think that we should do better than to format using the same locale as the translation.

Not the same locale, but not 100% independent either.

I can explain how that works in Android, for example.

Cheers, Mihai

mihnita commented 4 years ago

About extended plurals, like:

{ count , plural ,
   =0 {No candy left}
  one {Got # candy left}
  <10 {Got a few candies left}
  10-20 {Got a handful candies left}
other {Got # candies left} }

And it was a huge problem for proper localization. It was banned in most places I've been.

grhoten commented 4 years ago

"You owe {someNumber, number, currency}." - then the actual currency is inferred from the current locale - which is just nasty.

@MickMonaghan I agree. Actually, currency formatting that I've been involved with disallows this scenario. Currency formatting is a measured unit and not a number. The unit has to be explicitly defined outside of the current message.

zbraniecki commented 4 years ago

I am quite reluctant about it.

I agree with @mihnita. Such translations are rejected by the Mozilla L10n Drivers and the logic we use is that this is not a plural-based variant of the same string, but a set of separate strings, and which one to use should depend on some other selector than a localizer trying to build a selection like in the example. We documented that recommendation in https://github.com/projectfluent/fluent/wiki/Good-Practices-for-Developers#prefer-separate-messages-over-variants-for-ui-logic

mihnita commented 4 years ago

About editors for developers / translators: I would rather have a standard mapping to XLIFF for translators. It would work better with the existing tools, instead of forcing translators to "get out" of their existing tools, edit somewhere else, then bring the string back in (usually with copy/paste) And to that every time one needs to fix something.

Similar with developers: it is better to provide plugins for existing IDEs (Eclipse, Intellij, Visual Studio Code) than a standalone editor. And we don't need to write those plugins ourselves.

mihnita commented 4 years ago

Some extra bullets to the wish list. I've tried to not add things already listed, but I am not sure I managed 100%.

Support the reunion of functionality of both Fluent and MessageFormat, even if the final syntax looks like neither.
Plural / select / ordinal (more?) should apply to the full messages, not fragments (which is usually bad i18n)
Need the ability to add metadata for messages AND placeholders.
Allow parameters to get metadata from translators or from automated systems. For example if a message has a parameter with 10 possible variants (from resources) a translator (or a "service") might be able to add an piece of metadata saying that this is a "noun, singular, masculine". Kind of related to inflections, but not really. I think I need to add more info on this.
Ability to protect sections of the message
Open / close / standalone placeholders, and flags for placeholders. See canCopy / canDelete / canOverlap in the XLIFF 2.1 spec (http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html). This might overlap a bit with the html support, but it is a bit more generic.
Design things to be very modular.

Some thoughts on modularity:

There should be a "resource manager" that loads messages and deals with language negotiation, fallback, etc. That would be "an interface", with a default implementation, but developers should be able to provide their own. That deals not only with strings, but also other localizable resources (sound, video, images, styles, etc). It also works with the MessageFormat to load referred messages.
The syntax of the arguments passed to an API should be specified separately from the full storage format
Would be nice if the syntax used without an API (for binding) should be very close to the one for the API.

zbraniecki commented 4 years ago

No positional variables. No foo {0} bar and so on. All variables should have ids, both to improve readability and error recovery.

kipcole9 commented 4 years ago

Great initiative and for me perfect timing given I've been implementing functions for CLDR over the last 2 years (not based upon ICU). Reflecting on that experience and the great comments from here from people who have vastly more experience than me, I offer the following thoughts on requirements:

TL;DR (summary of thoughts)

Focus on a standard message format that can be expressed in at least standardised string and HTML formats
Use a format that has at least a good chance that the UI designer, the developer and the translator can grasp
Include an interchange format as part of the spec but don't include a storage representation in the spec which facilities sharing and integration with tooling. Will be an important part of driving adoption.

Problem domain

The WG is called "Message Format". Taking that spirit it would seem the shared domain of interest, irrespective of development language or deployment platform, is defining a canonical format for localisable messages. The API for such messages would, it seems to me, be an implementation detail outside the scope of the WG.
The purpose of messages is to express common intent between a UI designer, a developer, a translator and a user. So irrespective of the representation (or representations) chosen, to the extent possible, reading the message in the code should convey intent that is largely understandable by all stakeholders (ok, not the user).
It would also seem in scope to define a standard interchange format (see below). Development and runtime environments vary a lot but each benefits from sharing data and integrating into CAT and other tooling.

Format representations

There are at least three representations useful for messaging I can see:

Storage representation. Think .pot files which, despite being gettext oriented, do appear to vaguely recognise other messaging formats. But really, this is the typical resource bundle in some format appropriate to the development and runtime environments. I would propose this is not in scope for the WG since there will be a lot of variability. One representation I'm working on doesn't even have a static resource bundle but has updatable translations via websockets (server-side orientation)
Interchange representation. The canonical representation that can be shared amongst all implementations of whatever comes out of the WG. Arguably this is one of the reasons that gettext has strong adoption - a common file format that has a lot to tool support. XLIFF 2.1 would appear a strong candidate since it has a formal structure and specification and it supported by CAT tools. But it isn't (by design) easy to consume for UI experts or translators.
Source code representation(s). I see comments here mostly around string-based and HTML-based representations which makes a lot of sense. In each and any case I would like to see a format that is not white-space sensitive. The reason being that eventually some tooling has to decide if message a is just a transformation of message b and a common approach is hashing the message. In this case a canonical format of the message is required so that hashing is consistent. And thats hard to do if the format is whitespace sensitive (as the current ICU message format is).

Relationship to CLDR

I see several comments reflecting that message formatting in some areas would benefit from enhancing CLDR data. Formatting units of measure is a good example. Without building an unreasonable dependency, making recommendations to CLDR would be a very useful in advancing the overall I18n, l10n world.

grhoten commented 4 years ago

There is one additional thing to mention. Good pronoun handling is hard to do. Arabic is by far the hardest to do. You morphologically attach a suffix to the given user vocabulary, which isn't trivial string concatenation. You need to know the gender of the pronoun subject, the grammatical number (singular, dual or plural), and you need to know the type of pronoun (e.g. possessive, reflexive and so on). Without this, Arabic speakers have to rewrite translations in less natural grammar.

Hebrew also has to know the gender of the people being referenced.

In German and Russian, it's more about the gender of the noun instead of the person being referenced.

You can also get into issues with how people want to be referred to. A person may not like a pronoun involving "him", "himself", "her" or "herself" for various reasons.

I have yet to see really good pronoun handling. It's hard to get correct.

srl295 commented 4 years ago

This is a use case but has requirement implications in terms of specification:

Enable round trip through XLIFF (and possibly other localization formats). In other words, there should be a well-defined way to convert between such a message format and XLIFF.

Example: http://docs.oasis-open.org/xliff/xliff-core/v2.1/os/xliff-core-v2.1-os.html#dataref

This example from the XLIFF spec shows some kind of message format Error in {0}. converted to XLIFF:

<unit id="1">
  <originalData>
    <data id="d1">{0}</data>
  </originalData>
  <segment>
    <source>Error in '<ph id="1" dataRef="d1"/>'.</source>
    <target>Erreur dans '<ph id="1" dataRef="d1"/>'.</target>
  </segment>
</unit>

MickMonaghan commented 4 years ago

Can we keep the language used to retrieve the UI strings separate from the language/locale used to format variables/placeholders within a string? This would be consistent with how some OSs and some string formatting libs already separate UI language from locale formats.

While we definitely experienced a very vocal community of users of Firefox who want to use different translation from locale formats, this has also been a trap for regular users because date/time formats often contain translations.

For example, Japanese 2020年1月13日星期一下午12:03:10 or 星期一下午12時 (for { weekday: "long", hour: "numeric" }) would be very confusing if placed in a sentence with different locale.

There are even extreme cases. If the user had german translation, with a date that is formatted in en-US, there's a chance of flipping MM/DD and DD/MM order. If the sentence is in german, user has the right to interpret the "05/08" using german "DD/MM" pattern, and be very surprised if they later learn that it was actually en-US "MM/DD` taken from their OS locale formatting preferences.

My initial position is that we generally should, by default, format placeables (numbers, dates etc.) using the same locale as the translation is in, and allow for the develop to provide an alternative language negotiation for formatters in order to handle exceptions like you mentioned.

This is also important once we start talking about the error handling UX. Fluent has been designed to fallback using a locale chain, so if there's an error or missing string in the primary language, we'll fallback on the second best choice, rather than display an error and break the app. It's an important resilience measure for us. What's interesting is that that means that the locale chain used for formatters is per-bundle so that in the locale context ["fr-CA", "fr", "en"] we first try to localize a message in fr-CA using fr-CA formatters, but if there are errors and we end up localizing the message using en resources, we'll format the date/times using en locale.

Thanks @zbraniecki. Yes - I agree that:

this scenario can - and often does - lead to mixed language strings. - or possibly even confusing formats within a string
that we should, by default, format placeholders (when was the term 'placeables' introduced to the g11n lexicon?) according to the user's chosen language - unless they choose something different

My main reasons for advocating the separation of UI language from placeholder formatting are:

architecturally, retrieving translated strings and formatting placeholders are 2 completely separate operations. Translations have a cost - they have to be paid for. Placeholder formatting is totally free - it's comes from a lib. (Note, that applications can always choose to simply format the placeholders according to the user's chosen language - and disallow the choosing of a separate formatting locale)
If my products are used in markets whose languages I do not support, at least customers will be able to get properly formatted dates/times/numbers/calendars etc.

duerst commented 4 years ago

For example, Japanese 2020年1月13日星期一下午12:03:10 or 星期一下午12時 (for { weekday: "long", hour: "numeric" }) would be very confusing if placed in a sentence with different locale.

It would be very confusing in a Japanese text, too. It may not be confusing in a Chinese text. In Japanese, it would look much better (actually readable) as '2020年1月13日月曜日午後12:03:10'.

In general, I agree that formats for formatting instructions and formats for text translation are two different things and should be kept independent of each other.

echeran commented 4 years ago

I want to add to this list my desire that when approach the problem of [providing input to] message formatting, we tease apart the data model of the input from the specifics of syntax / format for each implementation. (After talking with @mihnita , I think we should also ensure that we have a way to represent this MF data model in XLIFF, as well as implement an JS-friendly API for ECMA-402 that supports the model.)

The importance of focusing on the data model is that if we get the structure of the data correct (ex: represent each piece of info only once, put it in the right place), then each implementation (ex: Fluent, FBT, ICU MessageFormat, etc.) is free to implement support for that input data according to their needs & tastes, and/or even support a superset of the input data if they have extra custom functionality.

I'll give a different, slightly more concrete example of that. Before doing so, let me say that there is a small number of essential constructs for representing data, but they come under many synonyms, so I will list out categories of synonyms for the ideas I will talk about:

structure of data: data model / schema / object model / data dictionary associative data: map / record / struct / object / message / dictionary sequential data: list / vector / array / sequence

Instead of coming up with a new language to encode an example scenario of a data model, I'll use an existing IDL (ex: Protobuf, Thrift, Avro). To show that the choice of IDL doesn't matter and is just for example's sake, I'll choose an IDL here that I've personally never used, including not used at my work -- Thrift.

So a made-up example of what the data model might be would look like:

struct MessageFormatInput {
  1: string id,
  2: MessageFormatPattern template,
  3: list<Arg> args,
  4: string locale,
  5: list<Placeholder> phs,
}

struct MessageFormatPattern {
  1: list<Part> parts,
}

union Part {
  1: string text,
  2: Placeholder ph,
}

struct Placeholder {
  1: string id,
  2: PlaceholderType phType,
  3: map<string,string> options,
  4: ...
}

enum PlaceholderType {
  UNDEFINED = 1,
  OTHER = 2,
  PLURAL = 3,
  GENDER = 4,
}

struct Arg {
  1: string phId,
  2: ArgVal value,
}

enum Gender {
  OTHER = 1,
  MALE = 2,
  FEMALE = 3,
}

enum PluralType {
  // ZERO, ONE, TWO, FEW, MANY, OTHER, EXACT
}

That is just an example -- the real thing would need discussion and iteration. (Example: Is string sufficient to represent locale? Shouldn't we refactor MessageFormatPattern into a MessageFormatTemplate that contains either a single pattern or a list of patterns depending on whether we have cases for selection-style placeholders?)

The main point is that the implementations -- JS, Python, Fluent, FBT, ICU MessageFormat, etc, -- can implement APIs to support this model accordingly. A C++ client might find the ICU string-oriented pattern syntax expedient, whereas a JS client might feel it more intuitive to use an API that represents the associative data (Thrift structs) as JS objects and sequential data (Thrift lists) as JS arrays. And as we formulate the data model, we can ensure that we can still represent this data as XLIFF so that we can ensure that we can integrate with localization use cases.

Iterating on the data model for the input to message formatting is a non-trivial task, but I think that discussion goes hand-in-hand with what user-visible functionality we support. So I think that if we put the focus there in the beginning, then we can decouple questions on the specifics of syntax / format / serialization (at least for now) and leave it to the individual implementers' discretions. The end result should still end up in several implementations that "feel" the same b/c they would all be based on a consensus of what a sufficiently complete set of formatting input data looks like.

jrwats commented 4 years ago

This thread is already branching into a few separate discussions and getting a little hard to follow (might just be this reader of course). I realize we "just got going", but we probably need to start thinking about breaking it up.

Separation of concerns

I'd like to consider separate formats and proposals for runtime (number formats, currency formats, dates), build/parse-time formats (interpolating variables, inner strings, enumerations, plurals, dealing with markup), and translation formats (perhaps translation format is out-of-scope, and we should just say "XLIFF" - I don't know).

NOTE: what I'm calling runtime and build-time formats here could be part of what @kipcole9 is calling source formats i.e. they are both written by the developer. These could be a a part of the same proposal. I can go into more detail here, but I suspect we'll want to break up this thread, so I'll save my breath.

In that same vein, I am 100% in agreement with @echeran's most recent comment

Markup

IMHO, in an ideal world, translators would see no markup and wouldn't be required to author markup either. To support that, we'd still need to support the designer/engineer's ability to wrap arbitrary text in arbitrarily nested markup elements. Engineers/designers shouldn't have to think about how to "connect" their inner/outer (or child/parent) strings for translator context.

FBT supports this at parse time by

Both connecting the description (metadata) for a string with its parent string and
Also providing a "link" between parent and child strings at runtime.

Thoughts on Source format

FBT uses a build-time transpiler, built on (Babel), to accomplish its parsing and transpilation to runtime source. This has become somewhat standard practice. Typesystems, for instance, necessitate some kind of transformation (Typescript and flow, to name names) when authoring JS canonically.

I'm mentioning it here to note that the source (AKA designer/engineer-written) format that we attempt to define need not be vanilla JavaScript despite it eventually making its way into the JavaScript runtime.

nbouvrette commented 4 years ago

Just caught up to the thread - very interesting discussion and information shared. I have a few questions/comments:

@zbraniecki

Fluent comes with a CAT tool - https://github.com/mozilla/pontoon / https://pontoon.mozilla.org/ A lot of effort in Pontoon at the moment goes into better WYSIWYG for Fluent selectors.

Actually is Pontoon closer to a TMS since is does project management and I presume also takes care of translation memory? It does seem to include its own CAT tool as well.

On a related topic, not sure if you have the answer but I started reading a lot of Fluent this week and I was curious about how "brand names" in Polish would look like in a TMS or even authoring (end to end) flow? If I understand correctly, Multi-variant Messages (https://github.com/projectfluent/fluent/wiki/Fluent-and-ICU-MessageFormat#multi-variant-messages) are basically similar to a function with a parameter?

This seems super powerful but I would be curious how easy it is to implement using which tools and process?

@zbraniecki

Features like compound messages are important only when you look at the problem in context of UI widgets. The drive to be agnostic may lead to a syntax that is not really optimized for anything.

Could you provide some examples of what kind of optimization should be considered part of the syntax?

I'm having a hard time picturing it because if I think of HTML for example, I can see a limited amount of markup (bold, italic, etc) that would be used in the syntax if we consider a message to be a single sentence.

I also thought that the focus around the syntax is being able to easily write sentences that look human written while using dynamic data.

Markup to me seems to be more on the integration side when you consume the output of the syntax. Most TMSes already seem to be handling common markup adequately.

@mihnita

About editors for developers/translators: I would rather have a standard mapping to XLIFF for translators. It would work better with the existing tools, instead of forcing translators to "get out" of their existing tools, edit somewhere else, then bring the string back in (usually with copy/paste) And to that every time one needs to fix something.

Are you sure XLIFF is used all by translators? Popular CAT tools probably can support it but what about all the new online TMSes that have varying levels of support for XLIFF?

I still believe that finding a way to keep the syntax independent from a file format would be ideal in terms of flexibility & adoption, but after listening to @grhoten's presentation I am wondering if this is even possible if one of the goals is to offer better inflection support.

@mihnita @zbraniecki

I am quite reluctant about extended plurals.

From my experience, the example that @jamuhl was linguist-friendly because it used full sentences.

Without extended syntax like this, you would need to split the "range strings" and add logic in the code to deal with different messages. By splitting this in different strings, there would be a risk that different linguist does the translation and it could also cause consistency issues. But of course, a more powerful syntax can be used the wrong way.

Maybe having some sort of syntax linter could help prevent i18n issues while unlocking those use cases?

I don't know Fluent a lot but it seems like this is already an issue they have to deal with? I was impressed with how powerful the syntax is.

@mihnita

Plural / select / ordinal (more?) should apply to the full messages, not fragments (which is usually bad i18n)

How would this work in a sentence that has multiple variables with different plurals?

@mihnita

There should be a "resource manager" that loads messages and deals with language negotiation, fallback, etc.

Regarding fallback, I don't know if Android can do this but it would be perfectly fine for example to have a fallback between fr-FR and fr-CA (before fr and en). If I'm not mistaken, Java only does "locale -> language" fallbacks with ResourceBundles.

For example, if the linguists or authors are from France, I think it makes sense to keep the strings in fr-FR to track its origin. Depending on the source of the text, you can end up with expressions that are only known in certain regions which should match the source locale rather than use a generic language code.

jamuhl commented 4 years ago

@nbouvrette @mihnita

There should be a "resource manager" that loads messages and deals with language negotiation, fallback, etc.

Could be part of the i18n lib -> language negotiation implementation depends on the environment. Fallback in eg. i18next is handled like locale (eg. en-US) -> language (en) -> fallback (one, multiple, different per language,...). In i18next we enable a lot of options to fallback to something better than "nothing" -> language fallbacks, file fallbacks, ... but this can be rather well handled in the i18n implementation itself

regarding XLIFF:

Very painful format - it's so powerful and flexible that is defacto a no standard format anymore as everyone has it's custom set of supported features...take 5 TM tools and you get 5 different xliff files that are hardly compatible with each other (even if using the same xliff version)...if going XLIFF keep it simple...

@nbouvrette

I still believe that finding a way to keep the syntax independent from a file format would be ideal in terms of flexibility & adoption, but after listening to @grhoten's presentation I am wondering if this is even possible if one of the goals is to offer better inflection support.

Fully agree - would be the best option - but like you (not that I can judge this really - I'm not a linguist) I guess that is not possible without loosing some features (but on the other hand - how often those features are needed...build a syntax for 9x% of the cases or 100%?!?)

zbraniecki commented 4 years ago

@nbouvrette

On a related topic, not sure if you have the answer but I started reading a lot of Fluent this week and I was curious about how "brand names" in Polish would look like in a TMS or even authoring (end to end) flow?

What we do is sth like this:

-brand-name = { $case ->
    [nominative] Firefox
    [genitive] Firefoksa
    ...
}

open-browser = Otwórz { -brand-name("genitive") }

You can see live example for Czech - https://hg.mozilla.org/l10n-central/cs/file/tip/browser/branding/official/brand.ftl#l16 - it also adds a gender that is later used in combination:

genitive case + gender - https://hg.mozilla.org/l10n-central/cs/file/tip/browser/browser/browser.ftl#l178
nominative (default) case + gender - https://hg.mozilla.org/l10n-central/cs/file/tip/browser/browser/browser.ftl#l187

The only thing I'd change is try to use the selector values as full sentences rather than just the differential part.

This seems super powerful but I would be curious how easy it is to implement using which tools and process?

Not sure what the question is exactly. We're working on improving the UX for selectors, but atm it looks like this:

Could you provide some examples of what kind of optimization should be considered part of the syntax?

Well, for example if we aim to use HTML/Web as a test against our future standards, we will have a strong reason to ensure that we are capable of providing a human readable/writable format because of Principle of Least Power. If we don't, then JSON being a good "container" be enough. If we want to target the Web, then Fluent's focus on being lenient in what we accept, and strict in what we produce, becomes part of the guiding philosophy. Fluent syntax and parser have strong recovery logic that is motivated by this principle. If we don't aim for the Web, we can question that principle and reject the requirement for this model.

I'm having a hard time picturing it because if I think of HTML for example, I can see a limited amount of markup (bold, italic, etc) that would be used in the syntax if we consider a message to be a single sentence.

A common problem when translating markup like HTML is the semantic elements that are embedded in text. For example a pargraph may contains several images, links etc. that should be localized (with attributes, for accessibility reasons for example!), and distributed within the paragraph. Without a good solution, historically, to allow localizers to place the element anywhere in the text, the solution was to create three messages:

my-message-pre = Welcome to
my-message-inside = Mozilla
my-message-post = website.
my-message-title = Link to mozilla.org.

<p>
  &my-message-pre;
  <a href="https://www.mozilla.org" title="&my-message-title;">&my-message-inside;</a>
  &my-message-post;
</p>

Now, that, quite obviously is very fragile, and doesn't scale - try to imagine this with two links in one sentence. Here's Fluent:

my-message =
    Welcome to
    <a data-l10n-name="link1" title="Link to Mozilla.org">Mozilla</a>
    website.

<p data-l10n-id="my-message">
  <a href="https://www.mozilla.org" data-l10n-name="link1'/>
</p>

Here, the paragraph will be translated to a single language (consistency), localizers can adjust the position of the link and translate its attribute, and it scales (we also are discussing removing the need for data-l10n-name). Several elements of the system were designed around it - error recovery + multiline messages, DOM Overlays etc.

If you recognize HTML as an important target (or React!) you may be inclined to "free" certain characters for the embedded syntax (>, <, etc.)

I'm not advocating for defining HTML/JS/CSS as the sole target of the syntax, but I think it doesn't hurt to recognize it as something we'd like to make sure our future syntax works well with.

nbouvrette commented 4 years ago

@zbraniecki

This seems super powerful but I would be curious how easy it is to implement using which tools and process?

Not sure what the question is exactly. We're working on improving the UX for selectors, but atm it looks like this:

So I guess my question is more on the translation/authoring process... let's say your input is:

-brand-name = Firefox
open-browser = Open { -brand-name }

My first question is how does the author know he cannot simply do this:

open-browser = Open Firefox

I'm guessing with Fluent and Pontoon, having different keys in different languages might be well supported but most TMSes would expect that all languages contain the same keys. This is one big reason MessageFormat works so well in TMSes because the keys remain the same across languages.

And now, let's presume all authors are familiar with the rules in a certain language for brand names and create both entries, how will the translator come up with the correct cases? Unless I'm mistaken:

If there are no language-specific predefined cases, this also means that each variable could use a different keyword to represent the same case?
For the brand name example, unless doing some sort of white-label, it might be simpler to not use variables at all?
If doing white-labeling, I'm not familiar enough with Fluent yet, but you would need some sort of nested variable? Is this supported?
I'm thinking this sort of mechanism might be more useful when using large datasets (e.g. a list of cities) that would be shared inside an app, but would also be easier to use across tools by predefining cases which might also require a lot of research/work

Are you aware of TMSes (other than Pontoon) that can help linguists support this? Or were you thinking of having external tools or even raw syntax?

I did a test and showed the syntax to someone with no engineering background and it did seem quite scary to them.

A common problem when translating markup like HTML is the semantic elements that are embedded in text.

I'm quite familiar with HTML but less with integrations of MessageFormat or Fluent with popular frameworks. On the other end, I am experienced with various popular TMSes and I know that HTML is quite well supported. For example, you could easily use:

my-message = Welcome to <a data-l10n-name="link1" title="Link to Mozilla.org">Mozilla</a> website.

And most TMSes would be smart enough to produce the following output to linguists:

String1: Welcome to $0Mozilla$1 website. String2: Link to Mozilla.org

You will note that HTML entities are replaced by placeholders. All linguists are used to this and can replace placeholders correctly in the translated output. Most popular TMS can also parse HTML for attributes and include them as a separate translatable string.

If our goal is to create a syntax that can be used by most companies, using normal tools I think that we should also use the TMSes for what they are good:

Most TMSes are focused on key/values (translation memory is heavily based on this)
TMSes love simple, existing, filetypes (e.g. .properties file, but probably not XLIFF)
TMSes are good with popular markup like HTML

I think if we agree that we can leave TMSes deal with markups, it will be easier to focus on linguistic problems.

longlho commented 4 years ago

I think the final format being consumed by a lot of platforms should just be simple key/value like @nbouvrette (more examples include Android strings.xml & iOS .strings format) but the intermediate representation can be something else that caters more to build systems. This doesn't prevent interop between the consumed format & the intermediate format (which can be as complex as AST with external references and such).

So basically what @kipcole9 said on Storage vs Interchange vs Src Representations 😄

mihnita commented 4 years ago

the solution was to create three messages: my-message-pre = Welcome to my-message-inside = Mozilla my-message-post = website. my-message-title = Link to mozilla.org.

That is horrible! :-) In this day and age most CAT (Computer Added Translation) tools support HTML out of the box.

zbraniecki commented 4 years ago

In this day and age most CAT (Computer Added Translation) tools support HTML out of the box.

The limitation wasn't CAT. It was the l10n system we used (DTD! :)).

@nbouvrette - is there a place where we could continue this thread without taking over the requirements one?

romulocintra commented 4 years ago

@nbouvrette - is there a place where we could continue this thread without taking over the requirements one?

Please open new issue. This new thread can work as knowledge share about MF and related topics.

mihnita commented 4 years ago

regarding XLIFF:

Very painful format - it's so powerful and flexible that is defacto a no standard format anymore as everyone has it's custom set of supported features...take 5 TM tools and you get 5 different xliff files that are hardly compatible with each other (even if using the same xliff version)...if going XLIFF keep it simple...

I know pretty well how bad the support is, unfortunately. The power and flexibility are factors that contributed to the lack of adoption (that, and some attempt (intentional or not) from some vendors to keep customers locked-in)

But there is a "core" of functionality that most tools support. I've use XLIFF in several places before, for major translation (and for all kind of content: software resources, html documentation, databases). And I know of several big companies doing that. If you know what is supported, it works.

For me this is an argument FOR XLIFF:

if we don't specify how our format maps to XLIFF we only contribute to the mess, as everyone implementing this format will produce different XLIFF files
if the XLIFF 1.2 (final spec in Feb 2008) is not properly supported, how long will it take for CAT tools to support it?

And this is also an argument for developing the format while considering at all times how that will interact with existing CAT tools. That includes not only how things are presented to the translators, but how leveraging works (or not). Most CAT / TMs assume a 1:1 model "you give me a source message, I give you back a translated message". When the input is 2 messages (singular / plural) and the output is 4 messages (for example because Russian has 4 plural forms), then we run into problems.

So the more powerful our format will be, the less chances we have from being supported by tools. It is not only about writing an import / export filter for a new format, it is about fundamentally changing the internals of all the localization chains. And that will not happen.

I have to admit that I think some of the big software companies also share some fault... (without naming names) These companies decided to create their localization systems, instead of forcing the vendors to support XLIFF properly. Customers have power... And even when they produced XLIFF, they produced horrible "variants". I can show you the difference between an XLIFF produced from Soy messages (Closure Templates) and the same message done properly.

That's why I think it is important to specify HOW to produce proper XLIFFs See here some examples:

mihnita commented 4 years ago

The limitation wasn't CAT. It was the l10n system we used (DTD! :)).

I know, I had the "honor" to work with that :-)

M.

mihnita commented 4 years ago

XLIFF 2.1 would appear a strong candidate since it has a formal structure and specification and it supported by CAT tools. But it isn't (by design) easy to consume for UI experts or translators.

100% agree. But XLIFF is not intended for UI experts, or even for translators. It is for translation tools. And can be presented in a translator-friendly way.

mihnita commented 4 years ago

Are you sure XLIFF is used all by translators? Popular CAT tools probably can support it but what about all the new online TMSes that have varying levels of support for XLIFF?

From what I've seen the newer (online) TMSes support XLIFF, often better then the more established ones. (I call them CAT tools, if they include more than TM (Translation Memory). They are newcomers in the market, and supporting an standard that does not lock-in the customer is one of the selling points. And I know of at least 3 or 4 of them that use Okapi (http://okapiframework.org/), which supports XLIFF pretty well.

I still believe that finding a way to keep the syntax independent from a file format would be ideal in terms of flexibility & adoption, but after listening to @grhoten's presentation I am wondering if this is even possible if one of the goals is to offer better inflection support.

I think it is possible. Starting with a "Data Model" as described by Elanco (@echeran), which has the benefit that it is language independent. Then specify standard ways to map that model to a JS friendly format and to XLIFF.

Would also means that one can develop C++ friendly, or Java friendly, or XYZ friendly formats that follow the same model, and they will be easy to map between each other. That would support (for example) a way to have the same message used server side from Java / C++ / Go / C# or front-end from JS / Dart / Typescript, with a simple, automatic conversion, no re-translation.

So I think I would define the goal of this effort as:

design a "Data Model" in a language-independent format
standard way to map that model to / from a JS friendly format (for ECMAScript, because that's how this whole things started :-)
standard way to map that model to / from XLIFF for localization

===

The Data Model idea would also help make things more syntax neutral. It would be less about the "personal baggage" (MessageFormat vs Fuent vs FBT) and more about what can really be done.

I've tried to play with Fluent and FBT for the last few days, and I think that they have more in common that different.

mihnita commented 4 years ago

I would like to see a format that is not white-space sensitive

There are pros and cons for this.

There are languages that don't use spaces between words (Chinese, Japanese, Thai, etc.) These are very clunky to use in a not white-space sensitive format. Everybody is "free to wrap things", but these languages are forced to keep everything in one line.

The other disadvantage for a white-space sensitive format is that at times one would really want to preserve spaces / new-lines. Then one would need some mechanism to specify that.

So I would not necessarily "carve in stone" this one (yet)

mihnita commented 4 years ago

I'd like to consider separate formats and proposals for runtime (number formats, currency formats, dates), build/parse-time formats (interpolating variables, inner strings, enumerations, plurals, dealing with markup), and translation formats (perhaps translation format is out-of-scope, and we should just say "XLIFF" - I don't know).

+100

This is the "lingo" I used in my "random thoughts" document:

A developer format
A runtime format
A localization format

The concepts don't overlap 100%, but I think it would be good to separate things and make them very modular. "Divide and impera" :-)

mihnita commented 4 years ago

Plural / select / ordinal (more?) should apply to the full messages, not fragments (which is usually bad i18n) How would this work in a sentence that has multiple variables with different plurals?

Idea (ignore the syntax, keep the idea :-)

{[photoCount userGender], [plural select],
  [   =1   male] {{userName} added a new photo to his stream.}
  [   =1 female] {{userName} added a new photo to her stream.}
  [   =1  other] {{userName} added a new photo to their stream.}
  [other   male] {{userName} added {photoCount} new photos to his stream.}
  [other female] {{userName} added {photoCount} new photos to her stream.}
  [other  other] {{userName} added {photoCount} new photos to their stream.}
}

See https://docs.google.com/document/d/1oiKRfkuCuatT9k459nYwYw3neQ2Vm3rJ4toOu9wNwr4/edit#heading=h.puneskg1pg5z or https://github.com/projectfluent/fluent/issues/4

mihnita commented 4 years ago

Regarding fallback, I don't know if Android can do this but it would be perfectly fine for example to have a fallback between fr-FR and fr-CA (before fr and en). If I'm not mistaken, Java only does "locale -> language" fallbacks with ResourceBundles.

Since Android N (3 years ago) Android does (a lot better) fallback. Some examples:

fr-CH => fr => fr-FR => fr-* => root
es-AR => es-419 => es-MX => es-US => es => es-ES => es-* (any Spanish) => root (MX and US are there because people used them before Android N to mean "Latin American Spanish(
zh-TW => zh-Hant-TW => zh-Hant => zh-Hant- => zh- (and region, but restricted to the one using Hant script) => root. At no time would this fallback to zh or zh-CN (which use Hans)
en-IN => en-GB => en-001 (international) => en => en-US => en-* => root

There are also aliases (for legacy) between iw / he, id / in, fil / tl, no / nb.

Basically once you understand that there is no fallback across scripts (so no zh-TW / zh-CN, or sr-Cyrl / sr-Latn), things "just work", no need to think about it.

unicode-org / message-format-wg