Closed stasm closed 5 years ago
Note that StringExpressions
must still recognize at least \"
and \\
, and I wouldn't mind keeping \uXXXX
there either.
Regardless of the outcome of the discussion in #115, I'd like to go ahead with this issue and remove all escape sequences from TextElement
and rely on StringLiterals
for edge-cases.
StringLiterals
({"\uXXXX"}
} or as HTML entities. {
can be written as {"{"}
.This proposal leverages StringLiterals
to emphasize the use of any special characters, which is consistent with how Fluent encourages explicit whitespace with {" "}
.
If we introduce a new Unicode literal in #115, then it can be used inside of Placeable
just like StringLiterals
can right now.
I'm not fond of a proposal that keeps \
for something (quotedstring), but then doesn't allow it in places where you want to escape, too.
No HTML entities, please. {"\uXXXX"}
is totally weird to me. Probably because my brain doesn't come primed with "everything in a placeable is fine".
My expectation is to make writing text easy, not just possible.
Spelling out a Unicode escape is already not-easy and in fact, it's also harder to spot while reading. Fluent should encourage most Unicode characters to be written as-is, which also means that Unicode escapes are edge-cases.
I think we're in the make complex things possible territory.
A couple of revised thoughts:
To me it's straight-forward that there are things working in text that don't work in quoted-text. Simply by the virtue that quoted text is quoted means that quotes need special treatment.
The opposite direction, things that work in quoted text but not in normal text, is rubbing me backwards.
An additional aspect is that we're forgiving in unreadable content, and more so now in the end-game of Fluent 1.0. Invalidating a message just because the unicode escape isn't in a placeable is weird.
Also, in normal text, a unicode literal would be in a string literal, but in a string literal, it wouldn't be.
I do like the idea to encourage people to put unicode escapes into string literals if possible, and I think it'd be a good use of a compare-locales warning to encourage to use {"\u00a0"}
instead of just \u00a0
.
To me it's straight-forward that there are things working in text that don't work in quoted-text. Simply by the virtue that quoted text is quoted means that quotes need special treatment.
I agree with the second sentence. I feel differently about the first one. It's because string literals already require special handling of escaped quotes that I see an opportunity to streamline text elements.
I do like the idea to encourage people to put unicode escapes into string literals if possible, and I think it'd be a good use of a compare-locales warning to encourage to use {"\u00a0"} instead of just \u00a0.
I think it's generally agreed that string literals are easier to scan for and notice. Compare:
toc1 = Terms\u00a0and\u00a0Conditions
toc2 = Terms{"\u00a0"}and{"\u00a0"}Conditions
Furthermore, forbidding toc1
makes implementing tools for Fluent easier, because they don't have to run extra logic to check for escapes in TextElements
. In case of Pontoon, for instance, the nbsps in toc2
would get automatically highlighted by the virtue of being placeables.
Also, in normal text, a unicode literal would be in a string literal, but in a string literal, it wouldn't be.
Which is OK in my book, because string literals are not meant to hold too much translation content anyways. It should go into text elements. Localizers should prefer Terms{"\u00a0"}and{"\u00a0"}Conditions
to {"Terms\u00a0and\u00a0Conditions"}
. String literals are meant to solve the edge cases (like leading whitespace), and are also suitable everywhere where text is not part of the translation content (call expression arguments, maybe variant keys, etc.). The nice thing about them is that they also visually emphasize the fact that they're special.
I'd like to make a decision about this soon, ideally early next week. Right now I'm 90% in support of removing all escapes sequences from TextElements
.
Just dawned on me that if we were to make the grammar simpler,
toc1 = Terms\u00a0and\u00a0Conditions
would actually need to render as Terms\u00a0and\u00a0Conditions
, literally.
Otherwise we need to explicitly forbid the escape sequences in text that are allowed in string literals.
I'd be willing to look at a patch if this would actually improve the readability of the spec.
We also hear about string literals as one tool to extend Variant Keys, at which point there might not be a way to denote a readable nbsp as an individual string literal.
Also, I still think that syntax isn't the right hammer for this nail. With 0.7 we made our syntax accepting ugly Fluent, and I think unicode escapes etc are along those lines.
Yes, that was my intent from the beginning, sorry for not making this clear! :) I'll prepare a patch on Monday.
I opened #181 with the proposed change.
@Pike I'd like to move ahead with this proposal. Did you have time to look at the proposed PR?
I looked at the PR, and I learned stuff. That's what lead me to say "We don't have a spec for the resolver." yesterday.
I take it that the intent here is that the following code parses successfully:
msg = This is a <strong>Mozilla</strong>\u00A0<em>Monitor</em> product{"\u00A0"}statement.
and that it'd render as (the ' ' aside, can't get my markdown down):
This is a Mozilla\u00A0 Monitor product statement.
I'd see this behavior as a bug.
Big picture architecture view I developed this week:
What I took from the patch is that it doesn't demonstrate the intended change. So I tried to find what we're doing, and what I found doesn't make me happier :-/ . The way it's currently implemented is by unspecified runtime parser "bugs", i.e., escape_sequences.json
looks structurally different from syntax parser to runtime parser. That bothers me In a post-RC world, where we'd expect other folks to write good runtimes. That also makes my head turn a bit in the realm of https://github.com/projectfluent/fluent/issues/156. If escapes had dedicated AST nodes (in reference and syntax parsers), they would allow for a documentation of their runtime behavior, and could also allow for easier cross-implementation testing. For the specification, having a dedicated production for escape sequences might actually make the spec nicer to read, too?
/* The opening brace in text starts a placeable, \ an escape sequence. */
let text_char =
either(
and(
not(string("{")),
not(string("\\")),
regular_char),
escape_sequence,
string("\u0020"));
let escape_sequence =
either(
LiteralEscape,
UnicodeEscape);
let LiteralEscape =
either(
"\\\\",
"\\{",
"\\\"");
let UnicodeEscape =
sequence(
"\\u",
regex(/[0-9a-fA-F]{4}/));
w/out quoted_text and probably ordering/defer etc. And avoiding to write tests for the POC. And I haven't sprinkled the .abstract
in there, which probably need to happen. And yes, I think adding \"
in general is OK, I made that intentionally to not stab me in the back about consistent expecations between string literals and text.
This is a Mozilla\u00A0 Monitor product statement. I'd see this behavior as a bug.
Can you explain what's buggy about this? I see this as perfectly valid and consistent product of the translation in question. All text outside of { … }
is returned verbatim to the environment, including the HTML markup as well as the \u00A0
after Mozilla
.
The way it's currently implemented is by unspecified runtime parser "bugs", i.e., escape_sequences.json looks structurally different from syntax parser to runtime parser.
The runtime version of escape_sequences.json
represents an intermediate representation of the AST suitable for resolution: escape sequences have been replaced by the characters they represent. It's not a bug, it's an optimization: we only do it once during parse time.
I agree that it would help to have an official spec for the resolver. resolver.js
is as close to it as we can get right now. It's also well tested. But the fact that it consumes an IR rather than the full AST means that it's different from what an actual spec for the resolver would be. I think this is a discussion for a different issue. Can you file it please? Let's keep this one about removing the \
escapes from TextElements
. We still have escapes in StringLiterals
, and the question of how to define the logic of their resolution remains valid even when this issue is closed.
If escapes had dedicated AST nodes (in reference and syntax parsers), they would allow for a documentation of their runtime behavior, and could also allow for easier cross-implementation testing.
I like your suggestion to add the escape_sequence
production to the CST. If we were to have dedicated AST nodes for escape sequences, we'd need to completely re-do StringLiterals
which right now don't allow any descendant elements in their content. Is this what you're suggesting?
Yes, I don't think that StringLiteral
s are atomic. I look at them as text fragments with some additional constraints to TextElement
content. That's my canonical view. I also look at their use as call arguments, and their proposed use in #90, for example.
To use a name that shows up in the grammar, I feel quoted_text
to describe my expectations on them best.
Practically, I'd expect to be able to copy and paste content between quoted_text
and text
.
Yes, I don't think that
StringLiterals
are atomic.
Thanks for filing #195. I agree that Fluent would benefit from some way of defining the expected behavior of known escape sequences. I don't think it should be done with AST nodes, however. Let's continue this conversation in the issue you filed.
Practically, I'd expect to be able to copy and paste content between
quoted_text
andtext
.
Would you require escaping "
in text and escaping {
in string literals? (Given the current definition of special characters in both.)
I made a few changes to the PR implementing this proposal (#181) in https://github.com/projectfluent/fluent/pull/181/commits/5ea35db4a9444dcc1eb773417e7ac28571e1c9fc which are inspired by @Pike's suggestions to lexically define the escape sequences in the grammar from https://github.com/projectfluent/fluent/issues/123#issuecomment-431303286.
regular_char ::= [\\u{9}\\u{20}-\\u{D7FF}\\u{E000}-\\u{FFFD}]
| [\\u{10000}-\\u{10FFFF}]
/* The opening brace in text starts a placeable. */
special_text_char ::= "{"
/* Double quote and backslash need to be escaped in string literals. */
special_quoted_char ::= "\""
| "\\"
text_char ::= regular_char - special_text_char
/* Indented text may not start with characters which mark its end. */
indented_char ::= text_char - "}" - "[" - "*" - "."
literal_escape ::= "\\" special_quoted_char
unicode_escape ::= "\\u" /[0-9a-fA-F]{4}/
/* The literal opening brace { is allowed in string literals because they may
* not have placeables. */
quoted_char ::= (text_char - special_quoted_char)
| special_text_char
| literal_escape
| unicode_escape
@Pike asked me to clarify the impact of this proposal, with examples. I'll start with the role of TextElement
and StringLiterals
.
TextElements
are meant as the good default storage type for translations. Most translations should be possible to author using just TextElements
. Because the Fluent Syntax is a transport format for text content, early on we decided to give text the best syntax there is: no syntax at all. That's why TextElements
don't require any delimiters.
This proposal together with #186 are about making the {}
pair the only special characters inside of TextElements
. Seeing a {
in text should instantly make users think Aha! This is special!.
StringLiterals
are intended for short strings where TextElements
are not suitable for the job, mostly because of the lack of delimiters (e.g. as arguments to functions, maybe as variant keys). They can also be used to emphasize something about the translation which would otherwise be lost to the reader.
{" "}
can be used to denote leading and trailing whitespace, if necessary.{"\u00A0"}
can be used to make the non-breaking space visible in the translation (and automatically highlighted be tools, because it's a placeable).{"{"}
can be used to insert the literal opening brace: {
.{"."}
can be used to insert the literal dot at the beginning of the line, which would otherwise be interpreted as an attribute start.{""}
is an idiom for defining an empty value, which usually means there's something special going on in the translation which needs it.It's the Simplicity principle applied.
Simple things should be simple. Complex things can be possible. The syntax used for describing translations should be easy to read and understand. At the same time it should allow, when necessary, to represent complex concepts from natural languages like gender, plurals, conjugations, and others.
I'd like StringLiterals
to serve two goals. When reading Fluent files, I want localizer to instantly think Aha, there's something special going on here! when they see them. When writing Fluent, I want localizers to think This doesn't work in TextElements
; I need to use a StringLiteral
.
Recognizing backslash escapes in TextElements
goes counter to both of these goals because it creates another way of achieving the intended result in some cases, but not all.
OK, time for some examples. The ↓
points to the result of FluentBundle.format()
.
# █ denotes a non-breaking space.
terms = Terms{"\u00A0"}and{"\u00A0"}Privacy
↓
Terms█and█Privacy
terms = Terms\u00A0and\u00A0Privacy
↓
Terms\u00A0and\u00A0Privacy
open-placeable = Use '{"{"}' to open a placeable.
↓
Use '{' to open a placeable.
# · denotes a leading space.
leading-whitespace = {" "}There are 5 spaces in front of this.
↓
·····There are 5 spaces in front of this.
# █ denotes a non-breaking space.
markup = <em>Terms</em>{"\u00A0}"and{"\u00A0"}<em>Privacy</em>
↓
<em>Terms</em>█and█<em>Privacy</em>
In the browser, the above would render as:
Terms and Privacy
markup = <em>Terms</em>\u00A0and\u00A0<em>Privacy</em> ↓ <em>Terms</em>\u00A0and\u00A0<em>Privacy</em>
In the browser, the above would render as:
Terms\u00A0and\u00A0Privacy
markup = <em>Terms</em> and <em>Privacy</em> ↓ <em>Terms</em> and <em>Privacy</em>
In the browser, the above would render as:
Terms and Privacy
As you can see from these example, only
{…}
has any impact on the result offormat()
. All other text is treated literally and returned as-is. It's then up the bindings or the environment to perform more logic on this literal value.
Thanks for that post. For a hypothetical variant name, we'd have?
city-name = { $city ->
[ "New\u00a0York" ] Nueva{"\u00a0"}York
*[unknown] { $city }
}
↓
Nueva█York
Similar for arguments, though I couldn't make up a use-case there :-/
Yes, that's right.
For the arguments, I could imagine a far far future with JOIN($list, separator: "\u00A0")
.
We currently allow a few backspace-prefixed escape sequences inside of
TextElements
. I think there's an opportunity to simplify the grammar and make the syntax for special characters more consistent and user-friendly by leveragingStringExpressions
.\uXXXX
could be recognized inStringExpressions
or replaced by a new Unicode literal syntax (#115).\{
could be written as{"{"}
, providedStringExpressions
continue to forbid placeables in them. This is consistent with how explicit whitespace can be currently spelled out with{" "}
.\\
would become unnecessary.