projectfluent / fluent

Fluent — planning, spec and documentation
https://projectfluent.org
Apache License 2.0
1.43k stars 45 forks source link

Handling bidi text? #274

Open stasm opened 5 years ago

stasm commented 5 years ago

Should all placeable be wrapped in bidi isolates? Perhaps just VariableReferences?

zbraniecki commented 5 years ago

probably all intl formatters as well since they may end up using different locale in fallback scenarios and even in a single-locale scenario may require isolation. See https://github.com/tc39/ecma402/pull/290

stasm commented 5 years ago

Sounds reasonable, thanks.

Given our EBNF:

InlineExpression    ::= StringLiteral
                      | NumberLiteral
                      | FunctionReference
                      | MessageReference
                      | TermReference
                      | VariableReference
                      | inline_placeable

it would appear that we need to wrap VariableReference and FunctionReference, while other expressions are probably safe to leave as they are?

zbraniecki commented 5 years ago

yeah, that sounds right. @srl295 - do you have any backpointers or experience with interpolation and bidi? If we localize the NumberLiteral using Intl.NumberFormat to the same locale as the main strings is in, is there any reason to put FSI/PDI around it?

aphillips commented 5 years ago

Note that FSI/PDI (and other isolating controls) have weak support in browsers currently (see https://w3c.github.io/i18n-tests/results/bidi-algorithm#rli_etc). I get very poor results with isolating controls in Chrome, for example. Here's Chrome (left, incorrect) and FF (right):

image

HTML:

<p dir=rtl>&#x644;&#x2067;-1234.56&#x2069;&#x645;</p> <!-- RLI/PDI -->
<p dir=rtl>&#x644;&#x2066;-1234.56&#x2069;&#x645;</p> <!-- LRI/PDI -->
<p dir=rtl>&#x644;&#x2068;-1234.56&#x2069;&#x645;</p> <!-- FSI/PDI -->
<p dir=rtl>&#x644;-1234.56&#x645;</p>                 <!-- no controls -->

@zbraniecki Note that number strings often include leading/trailing punctuation (neutrals) and that digits are often left-to-right. The point of using isolating controls is that it establishes a separate base direction linked to the locale of the inserted string and that the resulting inclusion doesn't impact the containing string's layout (it eliminates "spillover effects" that can occur with the non-isolating controls).

You probably don't want to use first-strong heuristics (that is, FSI) when you know the direction of the placeable (i.e. it was made by a formatter) but instead want to use the direction of the formatter's locale (so RLI or LRI--which don't work any better in several major browsers). When you don't know the direction of the placeable you can use FSI in the absence of direction metadata (but it is better to have direction metadata or infer it from the language of the data if that's available). See String-Meta for more details.

Assuming we can get implementations fixed, then yes all placeables should be wrapped in isolating controls.

Pike commented 5 years ago

@spookylukey wrote a couple of interesting comments around isolation and attributes in https://github.com/django-ftl/python-fluent/blob/implement_escapers/fluent.runtime/docs/escaping.rst.

I've looked at that branch in particular because I think there's some conceptual overlap between the needs of bidi isolation and html escaping. I keep thinking that we might want to extract both algorithms to a post-format step, if format would return an iterable that provided enough meta data for these algorithms to do their respective jobs.

Which also provides all my thoughts on #273.

zbraniecki commented 5 years ago

Thanks @aphillips !

Assuming we can get implementations fixed, then yes all placeables should be wrapped in isolating controls.

Do you mean all placeables, or just the ones we listed?

Currently, we wrap all placeables in FSI/PDI, because we assume that directionality within the placeable may be different than the surrounding text.

For example, if my string is in ar, and I use Ecma402 Intl.NumberFormat to format the number, I still may end up with a different directionality (for example, if ar data is absent) for the number than for the surrounding text.

My current thinking is that we can skip the isolation for StringLiteral, MessageReference and TermReference. Those 3 are realiably guaranteed to match the directionality of the pattern.

For variables, functions and numbers (which are functions behind the scenes) I'd prefer to keep the FSI/PDI.

aphillips commented 5 years ago

@zbraniecki Actually, I do mean all placeables--and especially the ones that involve placing strings inside of other strings--so precisely StringLiteral, MessageReference, and maybe TermReference. We try to illustrate the problems in String-Meta here.

Those 3 are realiably guaranteed to match the directionality of the pattern.

Why do you believe this to be the case?

It would be even better, of course, to replace FSI with LRI or RLI if the placeable's base direction is known (which in most cases it should be). This helps with placeables that have opposite direction initial sequences (the HTML و CSS example in String-Meta was chosen to help illustrate this).

zbraniecki commented 5 years ago

Why do you believe this to be the case?

Because they should be in the same locale.

MessageReference in Fluent happens in such a case:

close-window = Close Window
close-window-command = Click { close-window } to close the window.

In such case, I'd expect there to be a soft guarantee that both messages are in the same script and share directionality. Similar situation happens with terms.

As for StringLiteral, an example would be:

padded-text = { "     " } This phrase is padded with 6 spaces.

Some time ago, the decision has been made to use string literals for start-padding of strings (otherwise Fluent will cut out the pre-padding). Since the translation and the literal come in the same locale and are bound together, I see a pretty good chance that they share directionality as well.

Formatting them to "\u2068 \u2069 This phrase is padded with 6 spaces." feels odd.

As to RLI/LRI. The cases where I want to use isolation are exactly where we don't know what directionality the placeable will take:

hello-world = Hello, { $user }!

Since the $user comes from the code (and maybe from the user itself), it may have any directionality. I want to wrap it in FSI/PDI to instrument layout to recalculate the directionality of this fragment. Result: "Hello, \u2068فارص\u2069!"

Does it make sense?