Closed zbraniecki closed 1 year ago
+100
What I've seen being useful was marking a placeholder / area as RTL / LTR. But that is not very often.
A lot more often was "smart detection", basically inspect the value of the parameter and "guess" what the best direction would be. See for example: https://developer.android.com/reference/android/support/v4/text/BidiFormatter
The android solution is not ideal, the developer should explicitly "wrap" the parameter
Pseudocode:
...loadString(id).format( bidiWrapper(userName))
I think it should be more like
...loadString(id).format(arg)
And the string would be something like ...{userName}...
It would be ALWAYS wrapped, by default, unless explicitly disabled:
...{userName, bidi, rtl}...
...{userName, bidi, ltr}...
...{userName, bidi, none}...
...{userName, bidi, auto}...
=> same as ...{userName}...
The wrapper is smart, does not add bidi control characters if not needed.
So you are not going to see Hello {LRM}John{PDF}!
in an English string :-)
I don't know what you think but to me, this type of feature seems related to text transformation (see related thread).
If I'm not mistaken Fluent handles this with function-like wrappers. I'm trying to picture a scenario where you might want to capitalize and change text direction. If we could have a standard way to transform text, it could keep this simple:
# This software is made by {brand}
يتكون هذا البرنامج بواسطة {brand, transform, {rtl, titlecase}}
If I'm not mistaken Fluent handles this with function-like wrappers.
No, Fluent implicitly wraps all placeables of type String in FSI/PDI to reset directionality.
+100 doing it by default.
Not 100% sure FSI/DPI is the right thing, I would have to spend some time experimenting. But yes to do the right thing by default, with the ability to turn it off for false positives.
+1 to providing this by default. Note that when direction metadata is available the FSI should be replaced with the appropriate base-direction isolating control.
Note too that @zbraniecki only mentions placeables of type string, but non-string placeables can have spillover effects (for example currency values).
@mihnita yes! I imagine that we would be able to do sth like:
bundle.formatPattern(pattern, {
userName: FluentString(user.name, {dir: "rtl"});
});
as an option, and then, if that's provided, we can specify the directionality if it differs from the direction of the translation. If it matches, then we can skip directionality signs. If it is unknown, then we can use FSI/PDI.
@aphillips yes! In particular, we can know if the formatter provided result in the same language as the translation and wrap in marks or not. For most common scenario, where the currency formatter provided formatted text in the same direcitonality as the translation, we could skip it, but if we had to fallback and the currency is in different directionality, we would wrap.
A few comments:
Safari and iOS support FSI/PDI. Edge supports it as well. Windows modern APIs do support it, win32 does not.
We can also take a look at what Android does.
The BidiFormatter::unicodeWrap
takes a TextDirectionHeuristic
, with several supported out of the box: android.text.TextDirectionHeuristics
At runtime that method looks at the value of the parameter, and adds the proper BiDi control characters, a bit smarter than just FSI / PDI, or first strong, or any "fixed" approach.
Dart has two methods, one "wraps" using BiDi control characters, the other one uses HTML tags: https://api.flutter.dev/flutter/intl/BidiFormatter-class.html
I'm not advocating for any of these "as is", just submitting them as "prior art" and source of inspiration.
But if we go with this direction then I would call the wrappers inside of MessageFormat, not force the developers to wrap parameters by explicitly calling these kind of helper methods.
Can we revisit it now?
It seems like we still didn't add it to the spec. I suggest that we by default wrap any placeable in FSI/PDI marks, just like Fluent does it, in line with W3C recommendation for placeables - https://www.w3.org/International/articles/inline-bidi-markup/
We can introduce evasion logic that allows us to explicitly turn off FSI/PDI for a given message format as an option to communicate request to format a message without inserting FSI/PDI.
Finally, we could start building evading logic for scenarios where the directionality of the surrounding text and the placeable is known to match. For example number/date inserted in the same locale as a surrounding message does not need FSI/PDI. Similarly, a string inserted could be marked with explicit directionality:
let mf = new MessageFormat("en");
mf.format("Hello, { $user }", { user: MFString("John", { dir: "ltr" }) });
or as matching:
let mf = new MessageFormat("en");
mf.format("Hello, { $user }", { user: MFString("John", { dir: "matching" }) });
In the former case the algorithm will detect directionality of "en" and if the directionality of MFString
matches it it'll evade FSI/PDI. In the latter it will evade it automatically.
@mihnita @stasm @eemeli
I'd like to suggest making a decision on it very soon. In my experience a lot of API users are not familiar with the problem space of directionality and the body of code starts growing where people expect to be able to match the output to a particular string and are surprised when FSI/PDI shows up in the output.
With Fluent we had to do quite a bit of evangelism - it was always well received, but definitely a paper cut.
I'm concerned that if we wait too long the argument of "too late" will pop up.
I tend to agree with @zbraniecki in general: to the degree possible this wants to be hidden in the "magick I18N stuff" and not be something regular developers have to think about all the time. Educating on bidi handling is hard and doesn't appear to add value until a company decides to do an RTL language.
However, I don't agree that inserting FSI/PDI is what W3C recommends. In markup contexts, we prefer that markup be used and include both language and direction metadata (i.e. both lang
and dir
attributes). We also prefer that the actual direction (e.g. LRI
or LRI
) be used whenever it is available. This both prevents spillover (due to isolation) and avoids problems with strings that have misleading strong directional characters at/near the start. We are spending significant effort in the W3C stack and possibly with ECMA-262 to try to get "localizable strings" to be first class citizens so that metadata can be scraped automagically for placeable values.
For formatted values (that is, where the placeable is a number, date, time, percent, currency value, usw that is generated by the message formatter) the base direction can be known from the locale. For unknown values (mainly strings), provision of metadata is required and FSI/PDI can be a fallback.
Note that some users may want to tailor the behavior because of their runtime environment, such as a few frameworks that don't yet support the isolating controls and show them as tofu. In this cases, RLM
/LRM
and embedding controls can be inserted as a shim. Others may want to turn off control generation because they are using a templating language or system that does the work for them.
Ah, good point on lang+dir, rather than just dir.
I think you're bringing two separate dimensions, which I'd categorize as:
1) What information we provide about placeables 2) How we annotate
I'll use the following example: "On January 15th 2022 at 5:45pm, Addison added 5 photos" which in MF2 will look something like this:
let $dateTime = {$timestamp :datetime date=medium time=medium}
let $personName = {$person :person firstName=long}
let $count = {$photoCount :number}
match {$count}
when 1 {On {$dateTime}, {$personName} added { $count } photo.}
when 0 {On {$dateTime}, {$personName} added { $count } photos.}
There are three placeables in this message and we may know the locale of the message itself (or not - is it possible for the lang/dir of the message to be undetermined via new MessageFormat("und")
?).
If dateTime
is resolved into the same dir/lang as surrounding message we don't want to annotate, but if the message is in arabic, but DateTimeFormat doesn't have arabic data and resolves to English, we should annotate at least with directionality:
On {\uLRI}January 15 2022 at 5:45pm{\uPDI}, Addison added 5 photos.
(we use LRI because we know that datetime is in English, and we either know that the whole message is in Arabic or it is unknown)
For the user name, we may have an API that informs in what lang/dir is the name provided and then compare it to the message lang/dir, or we may not know. If we do, and it differs, we can do the same as with date - LRI/RLI and PDI to pop. If we don't we can use FSI/PDI. If it doesn't differ we don't inject any.
For $count
we repeat the same logic as we did for datetime.
Now, as mentioned in my previous message, the tricky question is how the develop annotates lang/dir of the variable. I suggested MF2 to provide typed variables types much like fluent does with FluentDateTime
FluentNumber
etc. This would allow for MF2String("Addison", {lang: "en"})
as optional (if omitted we'll use FSI/PDI).
Second question is how to control what we inject. My initial proposal is something like this:
let mf = new Intl.MessageFormat("en", {
isolates: {
lri: "\uLRI", // or MF2MarkupElement("bdo", {dir: "ltr"})
rli: "\uRLI", // or MF2MarkupElement("bdo", {dir: "rtl"})
fsi: "\uFSI", // or MF2MarkupElement("bdo", {dir: "auto"})
pdi: "\uPDI", // or // or MF2MarkupElementClose("bdo")
}
});
This way HTML bindings can provide MarkupElements for the same feature, and plain text can use the Unicode isolate characters. If LRI/RLI is set to null
then FSI is used. If FSI/PDI is set to null
, then nothing is ever injected.
This means that by default (if isolates
is not explicitly provided) the API will inject unicode marks and frameworks can override them.
What this doesn't resolve is that in ideal world a message like: Hello {strong}{$name}{/strong}
would resolve to Hello <strong dir="auto">Addison</strong>
rather than to Hello <strong><bdo dir="auto">Addison</bdo></strong>
.
We may later evolve the logic to allow for population of attributes in cases where markup element is perfectly surrounding a placeable and we want to set dir/lang.
Same as before, +100 :-)
But now, with a lot more things already "settled", I think we can dig deeper on what can / can't be done.
I've been thinking about it, and we probably need to answer some sub-questions.
What to add, exactly?
What can a low level library use to wrap placeholders? The result might be used as plain text, or html, or something completely different.
Unicode control characters? HTML recommends using tags, not control characters.
HTML tags? We don't know if the consumer of the result understands HTML.
And we don't even know what kind of tags to insert.
A block kind of tag (div
), or inline one (span
)?
And should be <span dir="...">
, or a <bdi>
, or something else?
Even if HTML, should these be "events" (open tag, content, end tag) or DOM subtree (tag + content as child)
So I think the only thing that the spec can really say is put this info (somehow) in the "format to parts" (this chunk from here to there is RTL). And leave it to a different layer to adapt the result for final consumption (control chars, html tags, something else).
And what part of the "chain" can do it.
Is it the custom function?
Or is it the engine?
Or a post-processing step, after .format
(or .formatToParts) is invoked?
It the engine does it, all it can do when it sees ... pre ... {$ph :func} ... post ...
is something like this:
... pre ...
:func
... post ...
I don't think that is a good model. It still leaves some "guessing" And only deals with "the outside" if things.
I think we want to allow for functions that in fact generate multiple components.
Let's think HTML...
And have a matrix formatter, that produces a table. Or a list formatter that produces a drop-box. Or even a regular <p>
with <span>
in it.
The elements inside the result should also be wrapped.
"You have emails from {$people :listformat}..."
would probably have to result in
<p dir="ltr">You have emails from <bdi>person 1</bdi>, <bdi>person 2</bdi>, and <bdi>person 3</bdi>..."
Maybe <bd>
, maybe <span>
that's not the issue here. The issue is, each item needs to be wrapped.
Which the engine can't really do reliably.
So I think this is can only be done properly by the functions.
Do the translators needed to be able to change this, or not?
I would argue that yes, they need to.
If I have image tags in a string "To register with <img src="company_logo.jpg"> see <img src="next.jpg">"
You need a human to say "it's OK to flip the second image (next), but not the first one (company logo)".
The developer might know "ok, don't mirror the company logo", but you need the translator to tell you about the second one.
My proposals after this round of thinking:
src
, alt
), but there are some universal ones (dir
, lang
, etc).<strong><bdo dir="auto">
=> <strong dir="auto">
? (Addison's point)Of course, if an implementation is not in a generic library like ICU, but very specific to produce HTML (in a browser), then some of the steps might be short-circuited (produce HTML tags / DOM directly, without format to parts + post-process).
Each string/substring should have a language and direction attribute (note that this is what W3C I18N is asking TC39 for with the maybe-terribly-named Localizable
proposal). A formatToParts
can produce a sequence of Localizable
that the consumer can use to generate controls or HTML markup as needed.
I suspect that MF's format
(i.e. formatToString
rather than parts) should probably have a couple of modes, one of which is "do nothing" (just make a string and do not generate controls) and one of which is "plain-text" (i.e. generate isolate controls as needed).
Note that dir
only has three potential values: ltr
, rtl
, and auto
(first-strong/don't-know). Isolation should be the default vs. embedding. I'll have more detailed thoughts in a bit.
three potential values:
ltr
,rtl
, andauto
Ack, thanks.
Localizable
After a very-very superficial scan (can't call it read) of the Localizable proposal, and with the disclaimer that I don't "grok" the relation between W3C, WebIDL, and ECMAScript, or what WebIDL is really trying to do :-)
These are my quick impressions:
Can't direction be derived from locale?
WebIDL seems to be (mostly) "Unicode unaware / agnostic"
DOMString
=> "commonly interpreted as UTF-16 encoded strings ... although this is not required."ByteString
=> "might be interpreted as UTF-8 encoded strings ... although this is not required."USVString
=> The only one that seems to be guaranteed Unicode, but it's use seems to be discouraged (see the Warning)Which makes these strings kind of useless for l10n / i18n.
Should the Localizable
be explicit that it uses some kind of Unicode encoding? Which is even more important than the locale and direction (maybe it is saying that and I've missed it)
If there is resistance to Localizable, would it be an option to use Annotated types to express locale and direction. And (even more important in my opinion) the fact that the string annotated is Unicode?
Let me know if you think these points help in any way, and where should I cut / paste them (because it is clear they don't belong here :-)
If I correctly interpret what @mihnita @aphillips wrote below my last response we agree on the value and considerations.
The only item I'd like to clarify is if @mihnita believes that formatToString
should return the isolation marks or not (you say that the bidi/lang system should annotate parts, but I don't see your position on the string output).
The question is - what are the next steps? As I mentioned above, I'm concerned about Tech Preview being released without this and I'd like to make sure we don't have any more releases (even if they remain TP) that make testers work with MF output without this feature.
Can't direction be derived from locale?
Not entirely. Language information can be used as a fallback when no direction information is available, but we don't think it is a good general solution.
WebIDL seems to be (mostly) "Unicode unaware / agnostic"
It seems that way because of JavaScript's historical (and misguided) ambivalence about saying that strings consist of Unicode code points. In reality, the three types @mihnita cites have a clear relationship to their respective representations.
The point of Localizable
would be to create a type, class, or commonly shared data structure (via a "dictionary" definition) that specifications could just use. The "value" portion of a Localizable
would be the text bearing string and each string would also have a lang
and dir
attribute. That way one could write:
<!-- for some variable value "myVar" -->
<p lang="$myVar.lang" dir="$myVar.dir">$myVar.value</p>
There already exist mappings for RDF and as-a-string serialization schemes in JSON-LD and a number of specifications use what amounts to Localizable
as a JSON representation. A proposed definition for Localizable
exists in our document String-Meta at this location
If there is resistance to Localizable, would it be an option to use Annotated types to express locale and direction.
Yes! This is entirely an option that is on the table. We would need some group to publish a normative spec (in W3C terms, a "Recommendation" or REC-track document) with the "dictionary" in it which specs could refer to normatively. This is what we asked WebIDL to do, but they "only model things that exist in JavaScript", hence my detour to ask TC39 to make a Localizable
type. If we think that a Localizable
or "natural language string" type in JavaScript proper would be useful for I18N generally (and it certainly would make it easy for developers to use it vs. writing a data structure), then we should push for it. I suspect, though, that the headwinds are going to be strong.
@zbraniecki noted:
The only item I'd like to clarify is if @mihnita believes that formatToString should return the isolation marks or not (you say that the bidi/lang system should annotate parts, but I don't see your position on the string output).
As I mentioned, it could be optional and I suspect it should be optional. Control characters insertion could also be added later, since most consumers probably don't introspect inside strings to find directional boundaries. That is, it might not be a blocker for the preview, but would be Very Nice To Have (compare to current MF, which does nothing). Current formatters, such as NumberFormat
, only handle bidi issues internal to the formatted string value (cf. the thread with Peter Edberg about currency formats which various Amazon folk have commented on), but bidi isolation of placeables, including in MessageFormat
is up to the pattern string and implementer. (For an example, look at Amazon's internal I18N utilities library for BidiFormat
and friends)
I think another interesting question is: does formatToParts
provide controls or does it provide metadata (and you insert your own controls or markup)? Notice that if formatToString
provides controls and formatToParts
does not, then that would mean that the two do not produce equivalent code point sequences when concatenating the parts together.
I think that formatToParts
would produce a (standardized) meta that can be converted by a processing step to controls, html tags, something else, or nothing.
And formatToString
can be implemented by just iterating the parts from formatToParts
and appending to a string buffer, ignoring some parts.
So if there is a part saying "from here to there we have a bidi isolate", formatToString
can choose to ignore that info, or produce control characters.
For a low level library like ICU that should probably be an option and decided by the developer calling it (or the layers built on top of it). Probably would be good to do the same for ICU4X.
In recent years it looks like ICU is going in that direction.
For example LocalizedNumberFormatter.format
returns a FormattedNumber
, and there is no API that returns a string directly (similar to formatToString
). You need to explicitly call toString
on the result to get a string result.
And FormattedNumber
has methods like getGender()
, getNounClass()
, getOutputUnit()
and ways to iterate the "parts" (nextPosition(ConstrainedFieldPosition)
and AttributedCharacterIterator toCharacterIterator()
).
It looks very much like an "unpolished form" of formatToParts
.
I hope we can improve things a bit with MF2.
And I think that defining the result of formatToParts
is some other issue we need to revive :-)
My take on this, less verbose, and maybe more clear:
does formatToParts provide controls or does it provide metadata
I think my answer would me metadata.
then that would mean that the two do not produce equivalent code point sequences when concatenating the parts together
I think we should not concatenate parts and strings. Ideally each formatter function would return parts. MessageFormat would concatenate plain text (wrapped in a part) & parts returned by formatters. And the final conversion from parts result to string would iterate the parts to generate string result.
The question is: how to we invoke older formatters which already return strings with controls. Without thinking too much (so might not be a good idea) is that we need to wrap those functions in something that looks like the MF2 function signature. So it would return parts. And that "wrapper" would take the legacy string, and convert to parts, with meta for bidi info.
A few quick comments.
I'm in agreement that the data model should carry enough information to add extra info for directional formatting where necessary. When formatting to a plaintext string, that would be using the Unicode directional characters, but when formatting into other formats (such as HTML) those mechanisms can come into play. The choice of mechanism would depend on the formats.
On Fri, Oct 28, 2022 at 7:43 PM Zibi Braniecki @.***> wrote:
Ah, good point on lang+dir, rather than just dir.
I think you're bringing two separate dimensions, which I'd categorize as:
- What information we provide about placeables
- How we annotate
I'll use the following example: "On January 15th 2022 at 5:45pm, Addison added 5 photos" which in MF2 will look something like this:
let $dateTime = {$timestamp :datetime date=medium time=medium} let $personName = {$person :person firstName=long} let $count = {$photoCount :number}
match {$count}
when 1 {On {$dateTime}, {$personName} added { $count } photo.} when 0 {On {$dateTime}, {$personName} added { $count } photos.}
There are three placeables in this message and we may know the locale of the message itself (or not - is it possible for the lang/dir of the message to be undetermined via new MessageFormat("und") ?).
While that is theoretically possible, in practice there should always be a specific locale (at least to the lang code) for any message if there are any placeholders that require formatting. We can't, however, know the base direction of the message, because that would depend on the context in which it is being used.
If dateTime is resolved into the same dir/lang as surrounding message we don't want to annotate, but if the message is in arabic, but DateTimeFormat doesn't have arabic data and resolves to English, we should annotate at least with directionality:
On {\uLRI}January 15 2022 at 5:45pm{\uPDI}, Addison added 5 photos.
(we use LRI because we know that datetime is in English, and we either know that the whole message is in Arabic or it is unknown)
(Minor) I don't think that is a realistic scenario. If a system is supporting a language like Arabic in messages, then it would surely support the basic i18n functionality for Arabic. That would be a terrible UI for users.
On the other hand, having to shift scripts/directions for names would be a realistic example, so I think it would be better to focus on that in your scenario.
For the user name, we may have an API that informs in what lang/dir is the name provided and then compare it to the message lang/dir, or we may not know. If we do, and it differs, we can do the same as with date - LRI/RLI and PDI to pop. If we don't we can use FSI/PDI. If it doesn't differ we don't inject any.
For $count we repeat the same logic as we did for datetime.
Again, numbers are so basic that this (in a well designed system) shouldn't occur.
Now, as mentioned in my previous message, the tricky question is how the develop annotates lang/dir of the variable. I suggested MF2 to provide typed variables types much like fluent does with FluentDateTime FluentNumber etc. This would allow for MF2String("Addison", {lang: "en"}) as optional (if omitted we'll use FSI/PDI).
Note, now that https://www.unicode.org/reports/tr35/tr35-67/tr35-personNames.html#Contents is out, I'd recommend using examples from that (and it would be great to get comments on it). The algorithm for formatting does depend on either receiving the explicit locale of the name to be formatted, or imputing it. Not sure it would be a good idea to carry an imputed locale into the MF2 data model.
Second question is how to control what we inject. My initial proposal is
something like this:
let mf = new Intl.MessageFormat("en", { isolates: { lri: "\uLRI", // or MF2MarkupElement("bdo", {dir: "ltr"}) rli: "\uRLI", // or MF2MarkupElement("bdo", {dir: "rtl"}) fsi: "\uFSI", // or MF2MarkupElement("bdo", {dir: "auto"}) pdi: "\uPDI", // or // or MF2MarkupElementClose("bdo") }});
This way HTML bindings can provide MarkupElements for the same feature, and plain text can use the Unicode isolate characters. If LRI/RLI is set to null then FSI is used. If FSI/PDI is set to null, then nothing is ever injected.
This means that by default (if isolates is not explicitly provided) the API will inject unicode marks and frameworks can override them.
As I think Mihai noted, exactly how the injection would work would depend a great deal on the end environment. It might be better to consider that the Process that manipulates the data model to produce something other than plaintext (eg to produce HTML) needs to have enough information about the placeholders to determine whether they need directional structure (eg markup) or not.
Attributes
What this doesn't resolve is that in ideal world a message like: Hello {strong}{$name}{/strong} would resolve to Hello <strong dir="auto">Addison rather than to Hello <bdo dir="auto">Addison.
We may later evolve the logic to allow for population of attributes in cases where markup element is perfectly surrounding a placeable and we want to set dir/lang.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1295279227, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMELFCGEUSKUBATYNSLWFQGFZANCNFSM4KOXYIMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I'm concerned about Tech Preview being released without this I'm confused. The Tech Preview was already released, about 10 days ago. Do you mean before the production version is released?
On Sat, Oct 29, 2022 at 11:47 AM Zibi Braniecki @.***> wrote:
If I correctly interpret what @mihnita https://github.com/mihnita @aphillips https://github.com/aphillips wrote below my last response we agree on the value and considerations.
The only item I'd like to clarify is if @mihnita https://github.com/mihnita believes that formatToString should return the isolation marks or not (you say that the bidi/lang system should annotate parts, but I don't see your position on the string output).
The question is - what are the next steps? As I mentioned above, I'm concerned about Tech Preview being released without this and I'd like to make sure we don't have any more releases (even if they remain TP) that make testers work with MF output without this feature.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1295787291, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFJUNFPB2NER3AGDLLWFTXD7ANCNFSM4KOXYIMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@mihnita The older formatters return controls to elicit proper ordering of ambiguous sequences within a formatted string, such as a number (especially currency values) or date. The formatters do not provide exterior wrapping/isolation to prevent spillover effects (which is what we're talking about here).
@macchiati I don't agree that:
We can't, however, know the base direction of the message, because that would depend on the context in which it is being used.
We need to know the base direction of the string, since the string itself is a placeable into its rendering context. When messages don't have a base direction, they are subject to spillover effects or wrong base direction detection, particularly if they start with a misleading strong character. Worst-case, we can use first-strong. I suppose that this might be the realm of a higher-level protocol, such as a resource language. But if strings don't have a base direction, we won't know how to decorate them automagically to get the right results. Inferring the base from the language is possible if that's all we have.
The following examples can be test driven on this demo page. The Arabic pattern means roughly "price {x} + {y} shipping!"
First, placeables needs isolation to avoid string-internal spillover effects. If you paste this string into the text box (this is also one of the examples in the list box at the top of the page):
<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta>1,234.56 AED</span> + 12.99 USD \u0627\u0644\u0634\u062d\u0646!
You get:
Adding a dir
attribute to the price values (the placeables that message format might generate) produces the proper isolation (you can use Unicode controls instead of a span
with a dir
attribute):
<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta dir=auto>1,234.56 AED</span> + 12.99 USD \u0627\u0644\u0634\u062d\u0646!
If we don't know the base direction of the whole string, though, then when we insert it into a page we can get spillover effects that are unwanted. Let's simulate that by putting an opposite direction (English) wrapper around the string:
We promised: "<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta>1,234.56 AED</span> + 12.99 USD \u0627\u0644\u0634\u062d\u0646!"
... which produces the thoroughly broken:
Fixing the interior placeables helps:
We promised: "<span style='color:blue'>\u0627\u0644\u0633\u0639\u0631</span> <span class=magenta dir=auto>1,234.56 AED</span> + <span dir=auto>12.99 USD</span> \u0627\u0644\u0634\u062d\u0646!"
... but still leaves the exclamation point on the wrong side (other effects can be produced with other strings):
@mihnita The older formatters return controls to elicit proper ordering of ambiguous sequences within a formatted string, such as a number (especially currency values) or date. The formatters do not provide exterior wrapping/isolation to prevent spillover effects (which is what we're talking about here).
I know. But the old formatters can't be just "plugged" as is in the MF2.
There has to be a wrapper function implementing com.ibm.icu.message2.Formatter
interface (or com.ibm.icu.message2.Selector
for selectors)
That is needed because it has to take the "bag of options" from the placeholders the and map them to the proper settings in the ICU old formatters.
That's what I meant when I said
we need to wrap those functions in something that looks like the MF2 function signature." ... And that "wrapper" would take the legacy string, and convert to parts, with meta for bidi info
So the wrappers would have to add the "exterior wrapping/isolation" There is still the question on how to deal convert the interior bidi chars to some kind of meta on "parts"
@mihnita Can you clarify:
There is still the question on how to deal convert the interior bidi chars to some kind of meta on "parts"
If each "part" has a base direction property, that's enough to implement wrapping the "part" with either isolating controls or isolating markup. Interior bidi characters, including controls, will still be needed for ambiguous situations. For example, ar
short dates include RLMs to ensure proper sequencing:
\u0661\u200f/\u0661\u0661\u200f/\u0662\u0660\u0662\u0662, \u0661\u0660:\u0662\u0663 \u0635
vs. with the U+200F's removed:
(because /
is weakly directional and numbers are weakly LTR)
The whole date formatter output string has a base direction of RTL, so a message "formatted to parts" with an Arabic locale formatted date and an (untranslated??) English message might look like this as pseudo-JSON:
"messageResult": [
{ // part 0
"lang": "und",
"dir": "ltr",
"value": "You will receive your shipment on "
},
{ // part 1
"lang": "ar",
"dir": "rtl",
// value has interior controls but not exterior ones
"value": " \u0661\u200f/\u0661\u0661\u200f/\u0662\u0660\u0662\u0662, \u0661\u0660:\u0662\u0663 \u0635"
},
{ // part 2
"lang": "und",
"dir": "ltr",
"."
}
]
I don't think I was clear about what I meant. I agree that placeholders often need wrapping — that isn't in question. But let's take an example of a message that has embedded placeholder components, where each of those components can also have embedded components:
John Smith purchased stock in NYSE:F for $3.21M on Tuesday, March 3.
In this example there can actually be a reasonably deep structure of embedded components:
{{John} {H.} {Smith}} purchased stock in {NYSE:F} for {${{3.21}M}} on {{{Tuesday}, {March} {3}} at {11:57}}.
The question is for a given embedded component, should the component wrap itself, or should the embedder wrap the component? The component doesn't necessarily know anything about characters that will be surrounding it when embedded, so it can't necessarily know whether it needs wrapping or not. The embedding structure can, however, easily determine what the characters in any subcomponent are, when embedding it.
On Tue, Nov 1, 2022 at 10:38 AM Addison Phillips @.***> wrote:
@mihnita https://github.com/mihnita Can you clarify:
There is still the question on how to deal convert the interior bidi chars to some kind of meta on "parts"
If each "part" has a base direction property, that's enough to implement wrapping the "part" with either isolating controls or isolating markup. Interior bidi characters, including controls, will still be needed for ambiguous situations. For example, ar short dates include RLMs to ensure proper sequencing:
\u0661\u200f/\u0661\u0661\u200f/\u0662\u0660\u0662\u0662, \u0661\u0660:\u0662\u0663 \u0635
[image: image] https://user-images.githubusercontent.com/69082/199298221-9033aa61-6b7d-4adc-8cca-abc946aa7801.png
vs. with the U+200F's removed:
[image: image] https://user-images.githubusercontent.com/69082/199298420-5bd8943b-2cd2-4899-814c-54bf5e4ce563.png
(because / is weakly directional and numbers are weakly LTR)
The whole date formatter output string has a base direction of RTL, so a message "formatted to parts" with an Arabic locale formatted date and an (untranslated??) English message might look like this as pseudo-JSON:
"messageResult": [ { // part 0 "lang": "und", "dir": "ltr", "value": "You will receive your shipment on " }, { // part 1 "lang": "ar", "dir": "rtl", // value has interior controls but not exterior ones "value": " \u0661\u200f/\u0661\u0661\u200f/\u0662\u0660\u0662\u0662, \u0661\u0660:\u0662\u0663 \u0635" }, { // part 2 "lang": "und", "dir": "ltr", "." } ]
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1298880731, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMF67QWC6W4MHURASL3WGFIQZANCNFSM4KOXYIMA . You are receiving this because you were mentioned.Message ID: @.***>
I think that a given component should take care of its interior needs and then expose its own base paragraph direction. That way, if all the directions align you don't get extra characters providing unnecessary levels of isolation and the component doesn't need to know or be told its context--it just needs to report the base paragraph direction of its output (which it already knows).
Your example doesn't make that much sense to me: a date format or compact decimal format in a given locale will be assembling a string with a single base direction and the tokens it emits will be in a specific language. There can be local considerations (my example with RLM on the date 1/11/2022
above), which the formatter should take care of by emitting a string that is "display ready" in it's base direction. Isolation is not a panacea here: isolating the subformat tokens (month, day, year, etc.) in ١/١١/٢٠٢٢, ١٠:٢٣ ص
does not result in a correct bidi string unless the whole thing is also wrapped and the RLMs are more effective.
If you take your example and turn it into an MF pattern string:
{name,name,full} purchased stock in {stock} for {price,compact-short, currency} on {date,date,::EEEEMMMMd} and {date,date,::jm}.
... and it's (Google translated) Arabic friend:
اشترى {name} مخزونًا في {stock} مقابل {price} في {date} و {date}.
If each formatter function reports the base direction of its output string (e.g. John H. Smith
is ltr
) then the parent formatter (message format in this case) can use that to decide to wrap the string with controls (or markup). PersonalNameFormat
takes care of the insides of "John H. Smith", NumberFormat
takes care of "$3.21M", and DateFormat
takes care of "Tuesday, March 3" and "11:57 am". This makes implementation fairly simple: you only have to worry about whether/how to isolate whole strings that you are given.
It might be useful to approach this by figuring out what the to-string formatted output of MF2 should be.
We may have some parts of the output for which we can know the directionality (e.g. literal text in the parent locale or {$count :number}
) and others for which we might not be sure (e.g. {$name}
). Should the inclusion of isolating marks between such parts be something require by default? And if not, what about cases where we know that the directionality of adjacent parts is different?
Should the inclusion of isolating marks between such parts be something require by default?
Yes. And I believe we're converging on this consensus among all stakeholders in this thread.
MF2 should make it an extra step to produce multi-directional string output without isolation marks. By default it should use the information it has about placeholder positions to isolate at boundaries.
@zbraniecki How about cases where we know the directionality matches?
For example, in an en-US
context, we can presume that both literal text and the string representation of a placeholder like {$count :number}
are both LTR. Should we require isolation even in this case, or could we allow for an implementation to leave it out?
How about cases where we know the directionality matches?
Those should be exempted from marks.
For example, in an
en-US
context, we can presume that both literal text and the string representation of a placeholder like {$count :number} are both LTR.
It's a bit more tricky actually. We should evaluate whether the number formatter used to format $count
has the same directionality as the main text. If so, we can skip.
Also, as Addison pointed out, we may want to evaluate language information alongside direction. I'm a bit less clear on how exactly this meta information should look like, but I imagine that we could have a en-CA
text with Relative time format placeholder using en-US
and may want to mark it as lang=en-US
. @aphillips - is that something you'd like to suggest, or just that if the placeholder is a variable from the developer (say, user name, or proper name) and is marked as lang=fr
we should mark lang of that placeholder to be fr
, but if it's about I18n formatter, we don't need to separate out lang information?
Could we first figure out the absolute minimum that's required in the MF2 spec for formatted string output? That we're all agreed on as being a part of the base layer, while e.g. the shape of the formatted parts might well end up getting defined by specifications building on top of it.
Maybe something like this?
Where appropriate, the formatted string representation of a message MUST isolate message parts that may have different directionality than the message as a whole. Such a part MUST be prefixed with an explicit isolate character:
- LEFT-TO-RIGHT ISOLATE U+2066 if the part is known to have LTR directionality,
- RIGHT-TO-LEFT ISOLATE U+2067 if the part is known to have RTL directionality, or
- FIRST STRONG ISOLATE U+2068 if the part's directionality is not certain.
In all cases, the part MUST be postfixed with a corresponding POP DIRECTIONAL ISOLATE U+2069 character.
Such wording would require a part sequence like LTR/RTL/RTL to include an unnecessary PDI + RLI character pair between the RTL parts if the message as a whole is LTR. Should that be optimised out?
Such wording would require a part sequence like LTR/RTL/RTL to include an unnecessary PDI + RLI character pair between the RTL parts if the message as a whole is LTR. Should that be optimised out?
It doesn't work that way. If you have a base paragraph direction string that is LTR and you have two consecutive RTL insertions, you want isolation in between them to prevent spillover effects. Consider this example:
السعر 1,234.56 AED 12.99 USD الشحن
This has two placeable strings ("1,234.56 AED" and "12.99 USD") with only a space between them. Without isolation they draw like:
With isolating controls they draw correctly without spillover effects:
The only time that isolating markup or controls can be omitted safely is when:
(i) the placeable and the host string have the same base direction (ii) and either all characters in the placeable have the same base direction or the first and last characters are strong "same direction" as the "base direction".
This is why unknown strings need FSI/PDI around them.
@aphillips: If you have a base paragraph direction string that is LTR and you have two consecutive RTL insertions, you want isolation in between them to prevent spillover effects. Consider this example:
Ah, had not played around with that example; thank you, that was useful. I wasn't able to observe spillover when omitting inner isolates between parts with the same directionality, but their overall order is indeed affected. So if we're in an LTR context, and the logical order of our message is L1, R1, R2, L2
then if we isolate each part, the displayed order is as expected: L1, R1, R2, L2
. However, if we leave out the inner isolation between the RTL parts, then we'd observe L1, R2, R1, L2
.
The only time that isolating markup or controls can be omitted safely is when:
(i) the placeable and the host string have the same base direction (ii) and either all characters in the placeable have the same base direction or the first and last characters are strong "same direction" as the "base direction".
This is why unknown strings need FSI/PDI around them.
Is it FSI
specifically that we should be using, or should we use LRI
and/or RLI
if we do know the directionality of the inner part?
@eemeli
Is it FSI specifically that we should be using, or should we use LRI and/or RLI if we do know the directionality of the inner part?
It is FSI if the direction of the inserted string is unknown. It is LRI or RLI if the direction is known (matching the direction of the string).
I also think that a big part of the discussion is about who is responsible for adding those control characters, or special-bidi-control parts when we format to parts.
It is pretty clear that the "function" should be do it, because of situations like this:
Expires on {exp :date}...
Stuff to buy {lst :listformat}...
Where the formatted date needs internal directional control characters. And in the list case you probably want each item in the list isolated.
But should the result of the whole placeholder be wrapped? And if yes, who should do it, the function, or the "engine"
Here is what I mean: Expires on {exp :date}...
And let's say we want the format to parts result to be:
parts = [
"Expires on "
ISOLATE_START,
"Nov 11, 2022"
ISOLATE_END,
"..."
Should that be done by "the engine" (the part of MessageFormat implementation that is function agnostic, in only invokes functions and "glues" the result together)? Or that is again the responsibility of the function?
The engine:
for (each part in ast.parts) {
if (part is text) {
result.append(part)
} else if (part is placeholder) {
result.append(ISOLATE_START)
result.append(invoke placeholder.function with options and whatever else we need)
result.append(ISOLATE_END)
}
}
or the function:
for (each part in ast.parts) {
if (part is text) {
result.append(part)
} else if (part is placeholder) {
result.append(invoke placeholder.function with options and whatever else we need)
}
}
I am inclined to say the function is also responsible for that part. The function would know best if its own result needs wrapping or not.
I think there's alternative to:
parts = [
"Expires on "
ISOLATE_START,
"Nov 11, 2022"
ISOLATE_END,
"..."
We could do:
parts = [
{type: LITERAL, value: "Expires on ", dir: LTR},
{type: DATE, value: 293131221, dir: RTL},
{type: LITERAL, value: ".", dir: LTR},
"..."
]
and allow the consumer to decide on injecting marks.
@mihnita @eemeli @stasm @aphillips @echeran - thoughts?
I agree that the isolates want to be included in specific parts, not separate elements in the "parts" array. For cases where the direction and language are the same all the way through, it allows implementations to omit isolating controls (or markup or such). For cases where the parts are separately rendered, it allows the caller to extract language and direction metadata for a given span.
If we had an LString
type, the representation would be more like:
parts = [
{type: LITERAL, value: { value: "Expires on ", lang: "en-US", dir: "LTR" }},
{type: DATE, value: someDateValue},
{type: LITERAL, value: { value: ".", lang: "en-US", dir: "LTR" }}
]
The DATE
object would get language and base paragraph direction information from the formatter. The default for lang
would be und
and the default for dir
would be auto
(first-strong).
To @eemeli's point earlier, we could resolve this separately (and potentially later), provided we can agree on the "format-to-string" output. I agree that the code point sequences don't have to be identical to the concatenated toString
output of "format-to-parts", but it would be good if they were at least somewhat consistent :-).
Finally, note that parts
needs to have language and base paragraph direction metadata of its own. The language presumably is the locale of the formatter. The base direction might be provided by the resource provider. (In the case of ICU, we provide a guess at the base direction from the locale, although this is not as holistically provisioned as it might be.
So, in fact you argue that parts should be:
parts = {
elements: [
{type: LITERAL, value: { value: "Expires on ", lang: "en-US", dir: "LTR" }},
{type: DATE, value: someDateValue},
{type: LITERAL, value: { value: ".", lang: "en-US", dir: "LTR" }}
],
lang: "en-US",
dir: "LTR",
}
right? That's a pretty challenging alteration and incompatible with ECMA-402 FormatToParts, but maybe necessary?
Or we could assume that people can derive lang/dir
from resolvedOptions()
the way they would for getting lang/dir out of DateTimeFormat::formatToParts
?
Coming from resolvedOptions()
sounds right.
A few comments on the following. (Also, I'm assuming that this corresponds to the information relayed back to the client when the caller asks for the 'deep' model, not just toString call.)
parts = { elements: [ {type: LITERAL, value: { value: "Expires on ", lang: "en-US", dir: "LTR" }}, {type: DATE, value: someDateValue}, {type: LITERAL, value: { value: ".", lang: "en-US", dir: "LTR" }} ], lang: "en-US", dir: "LTR", }
First, I'm not sure you need the deep structure; flatter is usually simpler.
parts = { elements: [ {type: LITERAL, value: "Expires on ", lang: "en-US", dir: "LTR"}, {type: DATE, value: someDateValue}, {type: LITERAL, value: ".", lang: "en-US", dir: "LTR"} ], lang: "en-US", dir: "LTR", }
Secondly. language tagging can indeed give better results for a block of text being Chinese vs Japanese.
However, I think fine-grained tagging for language in constructed messages is usually unnecessary, and often counter-productive. In practice you really don't want a message to a Japanese person to contain a placeholder-substitution that is in a Chinese font. Nor do you typically want a constructed message for a user's language to have a segment that line-breaks or hyphenates differently than that user's would expect for their language.
Do you want some German Zuk- ker?
When a message gets constructed, you really want all the pieces of the message to be in the same language wherever possible. I don't want a Czech date in the parts above, but one that is really for en-US.
There are exceptions. If I'm getting voice directions to Zug, I'd like to hear /teɪk ðə nɛkst raɪt təˈwɔrdz tsuk/. But only in the case that the system knows that I speak both English and German; otherwise /zʌg/ is probably best. So only in exceptional cases do you need the lang value to be different than the overall language of the message, and only in exceptional cases do you want the language of the message to be different than the language that you ask the message to be constructed for. So a typical case would be that the language can be omitted from the enclosing parts.
parts = { elements: [ {type: LITERAL, value: "Expires on ", dir: "LTR"}, {type: DATE, value: someDateValue}, {type: LITERAL, value: ".", dir: "LTR"} ], lang: "en-US", dir: "LTR", }
For BIDI as well, it is only necessary to convey the status of a piece that differs from the enclosing parts; so those can also be optional in the cited case.
parts = { elements: [ {type: LITERAL, value: "Expires on "}, {type: DATE, value: someDateValue}, {type: LITERAL, value: "."} ], lang: "en-US", dir: "LTR", }
Now, I do think it would be useful to have examples of:
On Thu, Dec 8, 2022 at 1:50 PM Addison Phillips @.***> wrote:
Coming from resolvedOptions() sounds right.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1343417964, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGTYNRPQ2XXTSJNE7DWMJJ2JANCNFSM4KOXYIMA . You are receiving this because you were mentioned.Message ID: @.***>
When a message gets constructed, you really want all the pieces of the message to be in the same language wherever possible. I don't want a Czech date in the parts above, but one that is really for en-US.
In general, I agree. However:
You purchased the book "HTML و CSS: تصميم و إنشاء مواقع الويب"
My example was somewhat pedantic about lang/dir metadata because I'm thinking in terms of "attributed strings" or "attributed values". There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate.
The implementation we made when I was at Amazon tied the resource format and formatter together. The template structure used selectors (just as we've moved to selectors in MFv2) which resolved to a pattern string (by evaluating plurals, selects, and such) and the resulting pattern string was in a single language and had a single base direction. When we look at parts
(as above in this thread), the language and direction on literals that come from the pattern string itself are entirely redundant--the parts only exist because placeables appear inside the template (inside can be at either end, please note), causing us to have "parts" of the template expressed as separate literals. It's the placeables that need bidi isolation and language markup, not the literals (which can only ever be in one language with one base paragraph direction unless one is being stoopidly cute).
Does that make sense?
For BIDI as well, it is only necessary to convey the status of a piece that differs from the enclosing parts; so those can also be optional in the cited case.
This is not correct. Even if the base direction is the same, there are cases where isolation of placeables is desirable to prevent spillover effects. Consider the example The price is ${price} + ${shipping} in shipping
in Arabic:
السعر 1,234.56 AED + 12.99 USD الشحن
This should render:
السعر 1,234.56 AED + 12.99 USD الشحن
Note that the second string has RLI/PDI around the placeables--but all of the "parts" are RTL!! The presence of LTR characters and numbers in the currency values does not mean that their locale is not ar-AE
or that their base direction is not RTL. Also enclosing and ending punctuation positioning depends on direction.
This is not correct. Even if the base direction is the same, there are cases where isolation of placeables is desirable to prevent spillover effects. I'm not saying that. What I was saying is that if you don't need to carry the info in the element explicitly; you can inherit from the parts. That doesn't mean that the information dir: "LTR" isn't there, nor that it can't be used to avoid spillover effects. It just means that you can get dir: "LTR" from the parts.
On Thu, Dec 8, 2022 at 4:50 PM Addison Phillips @.***> wrote:
For BIDI as well, it is only necessary to convey the status of a piece that differs from the enclosing parts; so those can also be optional in the cited case.
This is not correct. Even if the base direction is the same, there are cases where isolation of placeables is desirable to prevent spillover effects. Consider the example The price is ${price} + ${shipping} in shipping in Arabic:
السعر 1,234.56 AED + 12.99 USD الشحن
This should render:
السعر 1,234.56 AED + 12.99 USD الشحن
Note that the second string has RLI/PDI around the placeables--but all of the "parts" are RTL!! The presence of LTR characters and numbers in the currency values does not mean that their locale is not ar-AE or that their base direction is not RTL. Also enclosing and ending punctuation positioning depends on direction.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1343686575, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFRGQJQ7CJOTHSOCWLWMJ64PANCNFSM4KOXYIMA . You are receiving this because you were mentioned.Message ID: @.***>
My example was somewhat pedantic about lang/dir metadata because I'm thinking in terms of "attributed strings" or "attributed values". There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate.
I'm not saying that we don't need the ability to tag BIDI; if the dir on an element isn't equal to a dir on the parts, it needs to be present. That is, I agree with your statement "There can be (and should be) an inheritance model so that data does not need to be replicated on every level. But we need the ability to tag data as appropriate."
data inserted into a string can be in a different language, e.g. You purchased the book "HTML و CSS: تصميم و إنشاء مواقع الويب"
On the other hand, I still doubt that the lang attribute is particularly useful. I'm not against having it be an optional attribute. I just have yet to see a convincing case where it is required (as I noted earlier). And in the case you give here, I don't see that it is. An example would help: especially given that the data sources will often not have that information, what would the process do that in the presence of the lang attribute that it wouldn't do otherwise?
It's the placeables that need bidi isolation and language markup, not the literals I'm a bit confused. In the examples you had, the someDateValue didn't have the extra attributes while the literals did. I'm guessing that the someDateValue was a stand-in for a tuple that did have the attributes. Is that the case?
I just have yet to see a convincing case where it is required (as I noted earlier).
We know, and you listed it yourself, that we'll want it for TTS.
I'm ok with it being optional, as it won't be used by toString reducer.
I'm fine with optional.
On Thu, Dec 8, 2022 at 5:52 PM Zibi Braniecki @.***> wrote:
I just have yet to see a convincing case where it is required (as I noted earlier).
We know, and you listed it yourself, that we'll want it for TTS.
I'm ok with it being optional, as it won't be used by toString reducer.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/message-format-wg/issues/28#issuecomment-1343732179, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBMEZMKDB2EWGOFXFDWMKGHTANCNFSM4KOXYIMA . You are receiving this because you were mentioned.Message ID: @.***>
Since placeables can be of mixed directionality, I'd like to suggest that Fluent's FSI/PDI insertion for string placeholders is added to requirements.
This allows a variable like
userName
to be inserted in a string with different directionality and inform the layout of the possible direction change.W3C backlog: https://www.w3.org/International/articles/inline-bidi-markup/ Fluent wiki: https://github.com/projectfluent/fluent/wiki/BiDi-in-Fluent