unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
229 stars 33 forks source link

Create and Collect Use Cases #2

Closed romulocintra closed 3 years ago

romulocintra commented 4 years ago

We need to define "Scope" , "Pipelines" to focus if we are designing for developers, translators, or runtime efficiency.

romulocintra commented 4 years ago

IMHO the future MF API should be focused on providing a low-level set of APIs extending the built-in Intl with reusable and pluggable formatters etc... The focus or the target should be: 1 - Developer/Translators 2-Tooling/Efficiency

In other words, MF should be designed having Developers in mind but make i18n library authors can converge their developments for the future MF and keep their tools on top of it in a smooth way. As a listed Developers/Translators should be taken as the main stakeholders...

jamuhl commented 4 years ago

agree...regarding tooling having a parser that parses a message to "AST" would already be awesome...but guess the community will come up with this anyway...so from my perspective message syntax, api and if we go not single message supporting referencing file format would be a good start

longlho commented 4 years ago

Strong vote for defining a AST for me because:

  1. A lot of tooling (linter, debugger, string collector) relies on AST
  2. Distribution pipeline can ship AST instead of strings to save on parsing runtime (& right now, parser code weight).

Having the community coming up w/ the AST creates a lot of inconsistency in parsing especially when it comes to escaping syntax char & placeholder enforcement.

zbraniecki commented 4 years ago

I like the AST threshold. Maybe EBNF could be a good starting point?

For reference, Fluent:

eemeli commented 4 years ago

For a reference, here's the EBNF-ish Peg.js parser that messageformat uses when compiling messages into functions.

Its output is an array of AST nodes.

longlho commented 4 years ago

Similarly formatjs also uses a PEG.js parser to generate our TypeScript AST. That powers our linter and allows us to deal w/ translation vendors that have extra limitations on ICU.

MarcusJohnson91 commented 4 years ago

IMHO the future MF API should be focused on providing a low-level set of APIs extending the built-in Intl with reusable and pluggable formatters etc... The focus or the target should be: 1 - Developer/Translators 2-Tooling/Efficiency

In other words, MF should be designed having Developers in mind but make i18n library authors can converge their developments for the future MF and keep their tools on top of it in a smooth way. As a listed Developers/Translators should be taken as the main stakeholders...

As someone who has implemented their own Unicode API from scratch, don't couple this format specifier syntax too closely with ICU, you'll just make it harder to implement and therefore less likely to be used.

Which brings me to my main point.

Why not just extend POSIX's positional format specifiers? e.g: printf("%1$Gs", "Male");

echeran commented 4 years ago

If I understand correctly in this thread, the discussion of defining the AST produced from parsing a file (ex: Fluent, ICU MessageFormat) is pretty similar to my comment in the other thread about defining a data model. If so, that's good. In some cases, the parser output AST looks a lot like the input data structure from an earlier proof-of-concept I wrote to exemplify what the data-oriented approach might look like in a dynamic language like JS.

Of course, the difference between the terminology comes from the fact that ASTs are generated from parsers of file/string syntax, and a data model is just a specification of data without regard to its syntax or source.

For parsing files with a specific syntax, the pros mentioned include the ability to reuse the grammar definition for parsing and validation. The cons are the possible relative difficulty of getting the syntax correct (ex: ICU MessageFormat), and the possibility that the syntax allows problems that we must guard against.

For starting from a data model, the pros are that we are effectively "defining the AST" (data model) while allowing alternate syntaxes (ex: Fluent, ICU MF) to coexist. We also allow certain target language implementations (ex: JS?) to idiomatically accept data literals as input instead of a string-only input. The cons are that we need to write code for data validation, and possibly that there is no standard serialization format.

And it sounds like there is support for decoupling the concepts of authoring format / syntax from the runtime format / structure of input data.

Feel free to correct me on the above summary.

nbouvrette commented 4 years ago

This is my first post on this issue but I'm having a hard time figuring out the difference between this thread and the wish-list (issue #3). They both seem to have overlapping conversations but I think what would help is to define what is the common use case in terms of integration that a syntax would need to support.

I will try to include this in my presentation next Monday because I saw a lot of different usages of acronyms such as TMS, CAT and AST that could represent different concepts for different audiences.

Maybe having a definition for all the terms would help?

For example there are a lot of discussion around file types, but there could be different file types in different stage of the syntax (one for developers and one for the TMS/linguists).

@echeran

And it sounds like there is support for decoupling the concepts of authoring format / syntax from the runtime format / structure of input data.

I'm not 100% confident about this yet because I've been able to pilot MessageFormat using different file format and exposing raw syntax to linguists on a large scale. This creates a much simpler solution, if you provide the right tools with the syntax.

If we think there is a need for decoupling then I think we should highlight clearly why and keeping in mind also that most TMS expect symmetric file input (same amount of keys in the input file and output file). On top of this, you can create translation projects from 1 source langue to multiple target language, which means they would all need to stay keep the same amount of keys. This is unless you want to start breaking up projects per language pairs but this can have an impact on existing processes, costing and reporting when doing enterprise-scale localization...

Of course you could probably work around this by creating an new file format or trying to leverage existing ones that offer more flexibility (e.g. XLIFF). But also from experience, we know that most TMS provide different levels of XLIFF support which might impact adoption negatively. This would also probably be the same if a new file type of created - getting broad TMS support can take quite a while.

romulocintra commented 3 years ago

Closing in favour of #119