Closed Pike closed 5 years ago
Are there characters which are better written as escapes rather than actual glyphs?
I suggest to use 4 or 6 digits, based on earlier conversations.
I'd like to throw the \u{…}
proposal into the mix, too. The number of hexdigits between the braces can be between 1 and 6. Examples: \u{9}
, \u{A0}
, \u{1F602}
.
I suggest to use 4 or 6 digits, based on earlier conversations.
Do you mean this as two alternatives of the proposal, or a single proposal which accepts both 4- and 6-digit-long sequences?
logical operators, pff. Support 4
, 6
. Not support 5
.
Would "\u00a0ff"
parsebe interpreted as <nbsp>ff
or as ꃿ
?
Would
"\u00a0ff"
be interpreted as<nbsp>ff
or asꃿ
?
Yeah, that's a problem. Maybe just \U00a0ff
?
unicode_escape ::= "\\u" /[0-9a-fA-F]{4}/
| "\\U" /[0-9a-fA-F]{6}/
I'm not a fan of \u{}
, for one because it gives {}
a different meaning in that context. I'm also concerned about the amount of work we'd have to throw at it.
I like the \\U
idea! That's how Python does it, right? Although in case of Python, it expects 8 hex digits after the \\U
, for UTF-32 I suppose? I had to refresh my memory on how the different Unicode encoding s worked (this SO answer was very helpful). IIUC, U+10FFFF
is the highest code point which the Unicode standard defines, due to compatibility reasons with UTF-16. If that's the case, expecting 6 digits after \\U
would make sense to me.
I agree about the point about imbuing more meaning into {}
, especially if we go ahead with #123.
@manishearth - do you have any thoughts on this from Rust? In particular, should we go for 6 digits, or 8?
Overall languages seem to be moving towards \u{...}
because it's unambiguous and less confusing -- \u
vs \U
is something you have to remember, and the precise variant of this changes across languages.
I'd avoid UTF16 if possible (though I guess it's okay as long as you validate that there aren't any lone surrogates -- and users coming from JS may expect this).
I would go with 6 digits if you pick \U
though.
Overall languages seem to be moving towards \u{...} because it's unambiguous and less confusing -- \u vs \U is something you have to remember, and the precise variant of this changes across languages.
I've noticed this too and I like this trend. The \u{...}
syntax is explicit and easier to remember than \u
vs \U
.
In case of Fluent, however, the braces {...}
already have another meaning in the syntax; they stand for interpolation. And because we're designing the Fluent syntax with non-technical localizers in mind, we're trying to be careful to not reuse tokens and sigils in different contexts with different meanings.
Fluent also allows astral Unicode characters in its source files, so I expect there will be little need to use escape sequences for codepoints requiring more than 4 hex digits. Their addition has been proposed for completeness sake and to make it possible to encode them without resorting to surrogate pairs.
I think we should go ahead with \UXXXXXX
.
Rust allows for all code points in source files too, the reason escapes exist is to let people specify them explicitly, especially in cases where there are invisible code points.
You can also do something like \u[..]
, pick a brace syntax
To summarize: We could either have two syntaxes:
terms-u = Terms{"\u00A0"}and{"\u00A0"}Conditions
terms-U = Terms{"\U0000A0"}and{"\U0000A0"}Conditions
Or a single one using some kind of delimiters:
terms-brace = Terms{"\u{A0}"}and{"\u{A0}"}Conditions
terms-bracket = Terms{"\u[A0]"}and{"\u[A0]"}Conditions
terms-paren = Terms{"\u(A0)"}and{"\u(A0)"}Conditions
terms-angle = Terms{"\u<A0>"}and{"\u<A0>"}Conditions
Or perhaps just one always requiring 6 hex digits:
terms-one-u = Terms{"\u0000A0"}and{"\u0000A0"}Conditions
In the last case, we could even consider dropping the u
prefix. The only other escape sequences which are currently supported are \\
and \"
. This would effectively reserve prefixes 0-9
and a-f
.
terms-drop-u = Terms{"\0000A0"}and{"\0000A0"}Conditions
(The above could also be considered for the variants with delimiters.)
Taking a step back: the primary use-case of Unicode escape sequences is to be able to use invisible or whitespace characters in translations such that they are clearly visible to reviewers and other translators. For all visible characters or combinations of characters, localizers and developers should be encouraged to use the actual Unicode graphemes.
Given the above use-case, the syntax of escape sequences in Fluent doesn't have to be succinct, but it should be easily recognizable as something special. Localizers familiar with the concept of escape sequence will benefit from the syntax being similar to syntaxes they know from other languages. Other localizers will edit the translations around the escapes or copy them from other places.
We need the ability to have composed unicode escapes and regular text for call arguments, and possibly variant names in the future, right?
Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters. In other words, using made-up examples of call arguments: JOIN($list, separator: "\u00A0")
but: DECORATE($text, with: "✨")
.
Yes, but with a note that Unicode escapes are primarily intended to represent whitespace and invisible characters.
Or for clarity when using characters whose glyphs may be visually ambiguous. If I see "–" in the source, I may be unsure exactly which dash it is; whereas "\u2013" is unquestionably an en-dash.
Of the options above, I would favor either the "\uXXXX" and "\UXXXXXX" pair, or "\u{...}" with up to 6 digits. These are widely familiar from other contexts, which helps a lot with recognition. (Don't force the use of 6 digits in all cases; that would make familiar codepoints like the 20xx block look quite unfamiliar.)
Thanks, everyone, for your input. It looks like everyone agrees that we should base the syntax of the Unicode escapes on existing solutions to maximize the chance that localizers are familiar with them.
The choice between the \uXXXX
and \UXXXXXX
pair, and the \u{…}
syntax is a hard one for me. I see benefits to using both. Re. the \u{…}
syntax, I was initially worried that re-using the braces here would be confusing because they already have another meaning in Fluent, but now I could argue that it's just another special use for them. They're still special, which is OK to me.
I wanted to see both approaches in action, and I prepared two PRs.
I opened #201 which adds the \UXXXXXX
syntax to the existing \uXXXX
one.
character-A = {"\u0041"}
face-with-tears-of-joy = {"\U01F602"}
terms = Terms{"\u00A0"}and{"\u00A0"}Conditions
copy = © 1998{"\u2013"}2018
I also opened #202 which changes the syntax to \u{…}
.
character-A = {"\u{41}"}
face-with-tears-of-joy = {"\u{1F602}"}
terms = Terms{"\u{00A0}"}and{"\u{00A0}"}Conditions
copy = © 1998{"\u{2013}"}2018
In case of \u{…}
, I think we should encourage serializers to left-pad codepoints below 4 digits with zeros. This looks like a common practice, used even in the charts published by Unicode.
# Both are valid but `padded` is preferred.
short = {"\u{41}"}
padded = {"\u{0041}"}
The benefits of the \u{…}
are obvious when more characters are included in the StringLiteral
. This might happen in function arguments, or in variant keys (#90), although there aren't currently many use-cases for it. Consequently, the examples below are contrived.
# A contrived example. This should use a numeric offset or an abbreviation.
now1 = It is {DATETIME($time, timezone: "Hawaii\u2013Aleutian Time Zone")} right now.
now2 = It is {DATETIME($time, timezone: "Hawaii\u{2013}Aleutian Time Zone")} right now.
# Another contrived example. A country code would be a better choice for the selector.
historic-countries1 = { $name ->
["Austria\u2013Hungary"] ...
}
historic-countries2 = { $name ->
["Austria\u{2013}Hungary"] ...
}
I'm in favor of \u{XXXX}
mainly because otherwise I'm afraid of \u2a2attention
- trying to guess where the \u
ends.
Looking at the tests of mishaps in #202, I extended them by an actual hex example:
num = \u{41}
msg = \u{a0}
yields num
to be a NumberLiteral
and msg
to be a MessageReference
. All of that parses fine, just creates runtime situations.
To me those fall out from the ambiguous use of {}
if we use them as unicode escape delimiters.
For that, I prefer `{"\u1324"} and {"\U123456"}.
Wait, why does it parse as a MessageReference?
On Thu, Nov 8, 2018, 5:39 AM Axel Hecht <notifications@github.com wrote:
Looking at the tests of mishaps in #202 https://github.com/projectfluent/fluent/pull/202, I extended them by an actual hex example:
num = \u{41}msg = \u{a0}
yields num to be a NumberLiteral and msg to be a MessageReference. All of that parses fine, just creates runtime situations.
To me those fall out from the ambiguous use of {} if we use them as unicode escape delimiters.
For that, I prefer `{"\u1324"} and {"\U123456"}.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/projectfluent/fluent/issues/194#issuecomment-436996609, or mute the thread https://github.com/notifications/unsubscribe-auth/ABivSE_fxwfj76umisGpXrYW3qaztrm3ks5utDQKgaJpZM4Xzb7I .
Because Unicode escape sequences are not valid in text (they are only in quoted StringLiterals
, #123) and because a0
is a valid identifier. msg = \u{a0}
parses as a Pattern
of two elements: TextElement {value: "\\u"}
and Placeable {expression: {MessageReference {id: "a0"}}}
.
I think we should error on both.
Would you want to make the backslash illegal in TextElements
? Or something else?
One more data point, we're already having strings with {"\u00a0"}
, so keeping that logic and just adding \U
will be easier to implement from a data compatibility point of view.
Would you want to make the backslash illegal in TextElements? Or something else?
I would make \u
illegal in TextElements I think.
I would make
\u
illegal in TextElements I think.
The big win of #123 is that the only special characters in TextElements
are now the curly braces. I prefer to keep it that way and not introduce exceptions, like \u
, which increase the learning curve and the discoverability of the syntax.
I'd like to go ahead with \uHHHH
and \UHHHHHH
. I see how the \u{...}
syntax can help in some cases, but I predict that these cases will be very rare. In most cases where a Unicode escape is needed, it's to encode a single character for visibility purposes. Using a placeable is a great tool to achieve visibility: copy = © 1998{"\u2013"}2018
makes the escape sequence stand out. Adding two more characters to this syntax ({"\u{2013}"}
) adds visual clutter for no significant benefit.
\UHHHHHH
escape sequence. I'll wait until Friday before merging it.
Now that we have more than the basic Unicode plane in the Fluent syntax, we should also support them in the Unicode escapes.
I suggest to use 4 or 6 digits, based on earlier conversations.
I wonder if we should exclude surrogate pairs at the same time, to prevent
\uD83D\uDE02
in favor of\u01F602
? The UTF-16 encoding these imply feel very implementation dependent to me.