unicode-org / message-format-wg

Developing a standard for localizable message strings
Other
229 stars 33 forks source link

Simplify source bidi isolation rules #781

Closed eemeli closed 4 months ago

eemeli commented 4 months ago

Drop the bidi rule, and allow name to be LR/RL/FS -isolated.

Allow an LRI immediately after a non-content newline.

Relax expression & markup isolation to not require pairing on a syntactic level, as the LRI can also be terminated by a newline.

aphillips commented 4 months ago

I wish you'd added this as a separate alternative.

I don't like that the isolates are part of the name rule---I worked hard to keep the isolates outside the rules for important constructs (like name)

You removed unquoted literals from being amenable to bidi isolation, but they should still be isolatable, no?

eemeli commented 4 months ago

I don't like that the isolates are part of the name rule---I worked hard to keep the isolates outside the rules for important constructs (like name)

Including the isolates in name doesn't change its parsed meaning, much like the | aren't a part of the parsed meaning of a quoted literal. It's the same situation as with isolated expressions, markup and patterns.

You removed unquoted literals from being amenable to bidi isolation, but they should still be isolatable, no?

They are, covered by the change to name:

unquoted       = name / number-literal

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

aphillips commented 4 months ago

The problem with allowing isolates into name is that it makes name comparison harder. Shouldn't the following two names be equal?

\u2066name\u2069
name

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

Actually, numbers are complicated in bidi because digits are weakly directional. The minus sign can swing around onto the "wrong" side visually.

The other reason I had unquoted and quoted together is that it simplifies what tools have to do. A tool can blindly isolate any literal separate from the decision to quote it and can blindly remove isolates from literals without looking at the contents.

eemeli commented 4 months ago

The problem with allowing isolates into name is that it makes name comparison harder. Shouldn't the following two names be equal?

\u2066name\u2069
name

As proposed, both of those strings would match the name rule, but as \u2066 and \u2069 are not valid name-char characters, they would be parsed according to the open-isolate and close-isolate rules, with name-body matching the four-character "name" string in both cases.

So the parsed value of the name would be "name" for both of the above, and they would be considered equal.

number-literal doesn't need isolation, because we've limited its valid values, so isolating name is enough.

Actually, numbers are complicated in bidi because digits are weakly directional. The minus sign can swing around onto the "wrong" side visually.

But number-literal only shows up in "code", which is always LTR, yes?

The other reason I had unquoted and quoted together is that it simplifies what tools have to do. A tool can blindly isolate any literal separate from the decision to quote it and can blindly remove isolates from literals without looking at the contents.

The proposed change doesn't change the number of constructs for which this can be done; it replaces "unquoted literals" with "names". Doing so lets us remove needing to separately and additionally pick out the LRM/RLM/ALM from the productions that include name.

eemeli commented 4 months ago

As requested, refactored as an alternative to the proposed solution. Also addressed the concerns identified in #787 and #788, and added an example showing how name isolation avoids a spillover the current proposal cannot.

I have also validated this solution by implementing it in my parser.