Pike commented 6 years ago

Now that we have more than the basic Unicode plane in the Fluent syntax, we should also support them in the Unicode escapes.

I suggest to use 4 or 6 digits, based on earlier conversations.

I wonder if we should exclude surrogate pairs at the same time, to prevent \uD83D\uDE02 in favor of \u01F602? The UTF-16 encoding these imply feel very implementation dependent to me.

stasm commented 6 years ago

Are there characters which are better written as escapes rather than actual glyphs?

I suggest to use 4 or 6 digits, based on earlier conversations.

I'd like to throw the \u{…} proposal into the mix, too. The number of hexdigits between the braces can be between 1 and 6. Examples: \u{9}, \u{A0}, \u{1F602}.

stasm commented 6 years ago

I suggest to use 4 or 6 digits, based on earlier conversations.

Do you mean this as two alternatives of the proposal, or a single proposal which accepts both 4- and 6-digit-long sequences?

Pike commented 6 years ago

logical operators, pff. Support 4, 6. Not support 5.

stasm commented 6 years ago

Would "\u00a0ff" ~~parse~~be interpreted as <nbsp>ff or as ꃿ?

Pike commented 6 years ago

Would "\u00a0ff" be interpreted as <nbsp>ff or as ꃿ?

Yeah, that's a problem. Maybe just \U00a0ff ?

unicode_escape      ::= "\\u" /[0-9a-fA-F]{4}/
        | "\\U" /[0-9a-fA-F]{6}/

I'm not a fan of \u{}, for one because it gives {} a different meaning in that context. I'm also concerned about the amount of work we'd have to throw at it.

stasm commented 6 years ago

I like the \\U idea! That's how Python does it, right? Although in case of Python, it expects 8 hex digits after the \\U, for UTF-32 I suppose? I had to refresh my memory on how the different Unicode encoding s worked (this SO answer was very helpful). IIUC, U+10FFFF is the highest code point which the Unicode standard defines, due to compatibility reasons with UTF-16. If that's the case, expecting 6 digits after \\U would make sense to me.

I agree about the point about imbuing more meaning into {}, especially if we go ahead with #123.

zbraniecki commented 6 years ago

@manishearth - do you have any thoughts on this from Rust? In particular, should we go for 6 digits, or 8?

Manishearth commented 6 years ago

Overall languages seem to be moving towards \u{...} because it's unambiguous and less confusing -- \u vs \U is something you have to remember, and the precise variant of this changes across languages.

I'd avoid UTF16 if possible (though I guess it's okay as long as you validate that there aren't any lone surrogates -- and users coming from JS may expect this).

I would go with 6 digits if you pick \U though.

stasm commented 6 years ago

Overall languages seem to be moving towards \u{...} because it's unambiguous and less confusing -- \u vs \U is something you have to remember, and the precise variant of this changes across languages.

I've noticed this too and I like this trend. The \u{...} syntax is explicit and easier to remember than \u vs \U.

In case of Fluent, however, the braces {...} already have another meaning in the syntax; they stand for interpolation. And because we're designing the Fluent syntax with non-technical localizers in mind, we're trying to be careful to not reuse tokens and sigils in different contexts with different meanings.

Fluent also allows astral Unicode characters in its source files, so I expect there will be little need to use escape sequences for codepoints requiring more than 4 hex digits. Their addition has been proposed for completeness sake and to make it possible to encode them without resorting to surrogate pairs.

I think we should go ahead with \UXXXXXX.

Manishearth commented 6 years ago

Rust allows for all code points in source files too, the reason escapes exist is to let people specify them explicitly, especially in cases where there are invisible code points.

You can also do something like \u[..], pick a brace syntax

stasm commented 6 years ago

To summarize: We could either have two syntaxes:

terms-u = Terms{"\u00A0"}and{"\u00A0"}Conditions
terms-U = Terms{"\U0000A0"}and{"\U0000A0"}Conditions

Or a single one using some kind of delimiters:

terms-brace = Terms{"\u{A0}"}and{"\u{A0}"}Conditions
terms-bracket = Terms{"\u[A0]"}and{"\u[A0]"}Conditions
terms-paren = Terms{"\u(A0)"}and{"\u(A0)"}Conditions
terms-angle = Terms{"\u<A0>"}and{"\u<A0>"}Conditions

Or perhaps just one always requiring 6 hex digits:

terms-one-u = Terms{"\u0000A0"}and{"\u0000A0"}Conditions

In the last case, we could even consider dropping the u prefix. The only other escape sequences which are currently supported are \\ and \". This would effectively reserve prefixes 0-9 and a-f.

terms-drop-u = Terms{"\0000A0"}and{"\0000A0"}Conditions

(The above could also be considered for the variants with delimiters.)

Taking a step back: the primary use-case of Unicode escape sequences is to be able to use invisible or whitespace characters in translations such that they are clearly visible to reviewers and other translators. For all visible characters or combinations of characters, localizers and developers should be encouraged to use the actual Unicode graphemes.

Given the above use-case, the syntax of escape sequences in Fluent doesn't have to be succinct, but it should be easily recognizable as something special. Localizers familiar with the concept of escape sequence will benefit from the syntax being similar to syntaxes they know from other languages. Other localizers will edit the translations around the escapes or copy them from other places.

Pike commented 6 years ago

We need the ability to have composed unicode escapes and regular text for call arguments, and possibly variant names in the future, right?

stasm commented 6 years ago

Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters. In other words, using made-up examples of call arguments: JOIN($list, separator: "\u00A0") but: DECORATE($text, with: "✨").

jfkthame commented 6 years ago

Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters.

Or for clarity when using characters whose glyphs may be visually ambiguous. If I see "–" in the source, I may be unsure exactly which dash it is; whereas "\u2013" is unquestionably an en-dash.

Of the options above, I would favor either the "\uXXXX" and "\UXXXXXX" pair, or "\u{...}" with up to 6 digits. These are widely familiar from other contexts, which helps a lot with recognition. (Don't force the use of 6 digits in all cases; that would make familiar codepoints like the 20xx block look quite unfamiliar.)

stasm commented 6 years ago

Thanks, everyone, for your input. It looks like everyone agrees that we should base the syntax of the Unicode escapes on existing solutions to maximize the chance that localizers are familiar with them.

The choice between the \uXXXX and \UXXXXXX pair, and the \u{…} syntax is a hard one for me. I see benefits to using both. Re. the \u{…} syntax, I was initially worried that re-using the braces here would be confusing because they already have another meaning in Fluent, but now I could argue that it's just another special use for them. They're still special, which is OK to me.

I wanted to see both approaches in action, and I prepared two PRs.

I opened #201 which adds the \UXXXXXX syntax to the existing \uXXXX one.

character-A = {"\u0041"}
face-with-tears-of-joy = {"\U01F602"}
terms = Terms{"\u00A0"}and{"\u00A0"}Conditions
copy = © 1998{"\u2013"}2018

I also opened #202 which changes the syntax to \u{…}.

character-A = {"\u{41}"}
face-with-tears-of-joy = {"\u{1F602}"}
terms = Terms{"\u{00A0}"}and{"\u{00A0}"}Conditions
copy = © 1998{"\u{2013}"}2018

In case of \u{…}, I think we should encourage serializers to left-pad codepoints below 4 digits with zeros. This looks like a common practice, used even in the charts published by Unicode.

# Both are valid but `padded` is preferred.
short = {"\u{41}"}
padded = {"\u{0041}"}

The benefits of the \u{…} are obvious when more characters are included in the StringLiteral. This might happen in function arguments, or in variant keys (#90), although there aren't currently many use-cases for it. Consequently, the examples below are contrived.

# A contrived example. This should use a numeric offset or an abbreviation.
now1 = It is {DATETIME($time, timezone: "Hawaii\u2013Aleutian Time Zone")} right now.
now2 = It is {DATETIME($time, timezone: "Hawaii\u{2013}Aleutian Time Zone")} right now.

# Another contrived example. A country code would be a better choice for the selector.
historic-countries1 = { $name ->
    ["Austria\u2013Hungary"] ...
}
historic-countries2 = { $name ->
    ["Austria\u{2013}Hungary"] ...
}

zbraniecki commented 6 years ago

I'm in favor of \u{XXXX} mainly because otherwise I'm afraid of \u2a2attention - trying to guess where the \u ends.

Pike commented 6 years ago

Looking at the tests of mishaps in #202, I extended them by an actual hex example:

num = \u{41}
msg = \u{a0}

yields num to be a NumberLiteral and msg to be a MessageReference. All of that parses fine, just creates runtime situations.

To me those fall out from the ambiguous use of {} if we use them as unicode escape delimiters.

For that, I prefer `{"\u1324"} and {"\U123456"}.

Manishearth commented 6 years ago

Wait, why does it parse as a MessageReference?

On Thu, Nov 8, 2018, 5:39 AM Axel Hecht <notifications@github.com wrote:

Looking at the tests of mishaps in #202 https://github.com/projectfluent/fluent/pull/202, I extended them by an actual hex example:

num = \u{41}msg = \u{a0}

yields num to be a NumberLiteral and msg to be a MessageReference. All of that parses fine, just creates runtime situations.

To me those fall out from the ambiguous use of {} if we use them as unicode escape delimiters.

For that, I prefer `{"\u1324"} and {"\U123456"}.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/projectfluent/fluent/issues/194#issuecomment-436996609, or mute the thread https://github.com/notifications/unsubscribe-auth/ABivSE_fxwfj76umisGpXrYW3qaztrm3ks5utDQKgaJpZM4Xzb7I .

stasm commented 6 years ago

Because Unicode escape sequences are not valid in text (they are only in quoted StringLiterals, #123) and because a0 is a valid identifier. msg = \u{a0} parses as a Pattern of two elements: TextElement {value: "\\u"} and Placeable {expression: {MessageReference {id: "a0"}}}.

zbraniecki commented 5 years ago

I think we should error on both.

stasm commented 5 years ago

Would you want to make the backslash illegal in TextElements? Or something else?

Pike commented 5 years ago

One more data point, we're already having strings with {"\u00a0"}, so keeping that logic and just adding \U will be easier to implement from a data compatibility point of view.

zbraniecki commented 5 years ago

Would you want to make the backslash illegal in TextElements? Or something else?

I would make \u illegal in TextElements I think.

stasm commented 5 years ago

I would make \u illegal in TextElements I think.

The big win of #123 is that the only special characters in TextElements are now the curly braces. I prefer to keep it that way and not introduce exceptions, like \u, which increase the learning curve and the discoverability of the syntax.

I'd like to go ahead with \uHHHH and \UHHHHHH. I see how the \u{...} syntax can help in some cases, but I predict that these cases will be very rare. In most cases where a Unicode escape is needed, it's to encode a single character for visibility purposes. Using a placeable is a great tool to achieve visibility: copy = © 1998{"\u2013"}2018 makes the escape sequence stand out. Adding two more characters to this syntax ({"\u{2013}"}) adds visual clutter for no significant benefit.

stasm commented 5 years ago

201 is the PR adding the support for the `\UHHHHHH` escape sequence. I'll wait until Friday before merging it.

projectfluent / fluent

Unicode Escapes should cover more unicode planes #194

201 is the PR adding the support for the `\UHHHHHH` escape sequence. I'll wait until Friday before merging it.

projectfluent / fluent

Unicode Escapes should cover more unicode planes #194

201 is the PR adding the support for the \UHHHHHH escape sequence. I'll wait until Friday before merging it.

201 is the PR adding the support for the `\UHHHHHH` escape sequence. I'll wait until Friday before merging it.