projectfluent / fluent

Fluent — planning, spec and documentation
https://projectfluent.org
Apache License 2.0
1.4k stars 45 forks source link

Remove backslash escapes from TextElement #123

Closed stasm closed 5 years ago

stasm commented 6 years ago

We currently allow a few backspace-prefixed escape sequences inside of TextElements. I think there's an opportunity to simplify the grammar and make the syntax for special characters more consistent and user-friendly by leveraging StringExpressions.

stasm commented 6 years ago

Note that StringExpressions must still recognize at least \" and \\, and I wouldn't mind keeping \uXXXX there either.

stasm commented 6 years ago

Regardless of the outcome of the discussion in #115, I'd like to go ahead with this issue and remove all escape sequences from TextElement and rely on StringLiterals for edge-cases.

This proposal leverages StringLiterals to emphasize the use of any special characters, which is consistent with how Fluent encourages explicit whitespace with {" "}.

If we introduce a new Unicode literal in #115, then it can be used inside of Placeable just like StringLiterals can right now.

Pike commented 6 years ago

I'm not fond of a proposal that keeps \ for something (quotedstring), but then doesn't allow it in places where you want to escape, too.

No HTML entities, please. {"\uXXXX"} is totally weird to me. Probably because my brain doesn't come primed with "everything in a placeable is fine".

My expectation is to make writing text easy, not just possible.

stasm commented 6 years ago

Spelling out a Unicode escape is already not-easy and in fact, it's also harder to spot while reading. Fluent should encourage most Unicode characters to be written as-is, which also means that Unicode escapes are edge-cases.

stasm commented 6 years ago

I think we're in the make complex things possible territory.

Pike commented 6 years ago

A couple of revised thoughts:

To me it's straight-forward that there are things working in text that don't work in quoted-text. Simply by the virtue that quoted text is quoted means that quotes need special treatment.

The opposite direction, things that work in quoted text but not in normal text, is rubbing me backwards.

An additional aspect is that we're forgiving in unreadable content, and more so now in the end-game of Fluent 1.0. Invalidating a message just because the unicode escape isn't in a placeable is weird.

Also, in normal text, a unicode literal would be in a string literal, but in a string literal, it wouldn't be.

I do like the idea to encourage people to put unicode escapes into string literals if possible, and I think it'd be a good use of a compare-locales warning to encourage to use {"\u00a0"} instead of just \u00a0.

stasm commented 5 years ago

To me it's straight-forward that there are things working in text that don't work in quoted-text. Simply by the virtue that quoted text is quoted means that quotes need special treatment.

I agree with the second sentence. I feel differently about the first one. It's because string literals already require special handling of escaped quotes that I see an opportunity to streamline text elements.

I do like the idea to encourage people to put unicode escapes into string literals if possible, and I think it'd be a good use of a compare-locales warning to encourage to use {"\u00a0"} instead of just \u00a0.

I think it's generally agreed that string literals are easier to scan for and notice. Compare:

toc1 = Terms\u00a0and\u00a0Conditions
toc2 = Terms{"\u00a0"}and{"\u00a0"}Conditions

Furthermore, forbidding toc1 makes implementing tools for Fluent easier, because they don't have to run extra logic to check for escapes in TextElements. In case of Pontoon, for instance, the nbsps in toc2 would get automatically highlighted by the virtue of being placeables.

Also, in normal text, a unicode literal would be in a string literal, but in a string literal, it wouldn't be.

Which is OK in my book, because string literals are not meant to hold too much translation content anyways. It should go into text elements. Localizers should prefer Terms{"\u00a0"}and{"\u00a0"}Conditions to {"Terms\u00a0and\u00a0Conditions"}. String literals are meant to solve the edge cases (like leading whitespace), and are also suitable everywhere where text is not part of the translation content (call expression arguments, maybe variant keys, etc.). The nice thing about them is that they also visually emphasize the fact that they're special.

I'd like to make a decision about this soon, ideally early next week. Right now I'm 90% in support of removing all escapes sequences from TextElements.

Pike commented 5 years ago

Just dawned on me that if we were to make the grammar simpler,

toc1 = Terms\u00a0and\u00a0Conditions

would actually need to render as Terms\u00a0and\u00a0Conditions, literally.

Otherwise we need to explicitly forbid the escape sequences in text that are allowed in string literals.

I'd be willing to look at a patch if this would actually improve the readability of the spec.

We also hear about string literals as one tool to extend Variant Keys, at which point there might not be a way to denote a readable nbsp as an individual string literal.

Also, I still think that syntax isn't the right hammer for this nail. With 0.7 we made our syntax accepting ugly Fluent, and I think unicode escapes etc are along those lines.

stasm commented 5 years ago

Yes, that was my intent from the beginning, sorry for not making this clear! :) I'll prepare a patch on Monday.

stasm commented 5 years ago

I opened #181 with the proposed change.

stasm commented 5 years ago

@Pike I'd like to move ahead with this proposal. Did you have time to look at the proposed PR?

Pike commented 5 years ago

I looked at the PR, and I learned stuff. That's what lead me to say "We don't have a spec for the resolver." yesterday.

I take it that the intent here is that the following code parses successfully:

msg = This is a <strong>Mozilla</strong>\u00A0<em>Monitor</em> product{"\u00A0"}statement.

and that it'd render as (the ' ' aside, can't get my markdown down):

This is a Mozilla\u00A0 Monitor product statement.

I'd see this behavior as a bug.

Big picture architecture view I developed this week:

What I took from the patch is that it doesn't demonstrate the intended change. So I tried to find what we're doing, and what I found doesn't make me happier :-/ . The way it's currently implemented is by unspecified runtime parser "bugs", i.e., escape_sequences.json looks structurally different from syntax parser to runtime parser. That bothers me In a post-RC world, where we'd expect other folks to write good runtimes. That also makes my head turn a bit in the realm of https://github.com/projectfluent/fluent/issues/156. If escapes had dedicated AST nodes (in reference and syntax parsers), they would allow for a documentation of their runtime behavior, and could also allow for easier cross-implementation testing. For the specification, having a dedicated production for escape sequences might actually make the spec nicer to read, too?

/* The opening brace in text starts a placeable, \ an escape sequence. */
let text_char =
    either(
        and(
            not(string("{")),
            not(string("\\")),
            regular_char),
        escape_sequence,
        string("\u0020"));

let escape_sequence =
  either(
     LiteralEscape,
     UnicodeEscape);

let LiteralEscape =
  either(
    "\\\\",
    "\\{",
    "\\\"");

let UnicodeEscape =
  sequence(
    "\\u",
    regex(/[0-9a-fA-F]{4}/));

w/out quoted_text and probably ordering/defer etc. And avoiding to write tests for the POC. And I haven't sprinkled the .abstract in there, which probably need to happen. And yes, I think adding \" in general is OK, I made that intentionally to not stab me in the back about consistent expecations between string literals and text.

stasm commented 5 years ago

This is a Mozilla\u00A0 Monitor product statement. I'd see this behavior as a bug.

Can you explain what's buggy about this? I see this as perfectly valid and consistent product of the translation in question. All text outside of { … } is returned verbatim to the environment, including the HTML markup as well as the \u00A0 after Mozilla.

The way it's currently implemented is by unspecified runtime parser "bugs", i.e., escape_sequences.json looks structurally different from syntax parser to runtime parser.

The runtime version of escape_sequences.json represents an intermediate representation of the AST suitable for resolution: escape sequences have been replaced by the characters they represent. It's not a bug, it's an optimization: we only do it once during parse time.

I agree that it would help to have an official spec for the resolver. resolver.js is as close to it as we can get right now. It's also well tested. But the fact that it consumes an IR rather than the full AST means that it's different from what an actual spec for the resolver would be. I think this is a discussion for a different issue. Can you file it please? Let's keep this one about removing the \ escapes from TextElements. We still have escapes in StringLiterals, and the question of how to define the logic of their resolution remains valid even when this issue is closed.

If escapes had dedicated AST nodes (in reference and syntax parsers), they would allow for a documentation of their runtime behavior, and could also allow for easier cross-implementation testing.

I like your suggestion to add the escape_sequence production to the CST. If we were to have dedicated AST nodes for escape sequences, we'd need to completely re-do StringLiterals which right now don't allow any descendant elements in their content. Is this what you're suggesting?

Pike commented 5 years ago

Yes, I don't think that StringLiterals are atomic. I look at them as text fragments with some additional constraints to TextElement content. That's my canonical view. I also look at their use as call arguments, and their proposed use in #90, for example.

To use a name that shows up in the grammar, I feel quoted_text to describe my expectations on them best.

Practically, I'd expect to be able to copy and paste content between quoted_text and text.

stasm commented 5 years ago

Yes, I don't think that StringLiterals are atomic.

Thanks for filing #195. I agree that Fluent would benefit from some way of defining the expected behavior of known escape sequences. I don't think it should be done with AST nodes, however. Let's continue this conversation in the issue you filed.

Practically, I'd expect to be able to copy and paste content between quoted_text and text.

Would you require escaping " in text and escaping { in string literals? (Given the current definition of special characters in both.)

stasm commented 5 years ago

I made a few changes to the PR implementing this proposal (#181) in https://github.com/projectfluent/fluent/pull/181/commits/5ea35db4a9444dcc1eb773417e7ac28571e1c9fc which are inspired by @Pike's suggestions to lexically define the escape sequences in the grammar from https://github.com/projectfluent/fluent/issues/123#issuecomment-431303286.

regular_char        ::= [\\u{9}\\u{20}-\\u{D7FF}\\u{E000}-\\u{FFFD}]
                      | [\\u{10000}-\\u{10FFFF}]
/* The opening brace in text starts a placeable. */
special_text_char   ::= "{"
/* Double quote and backslash need to be escaped in string literals. */
special_quoted_char ::= "\""
                      | "\\"
text_char           ::= regular_char - special_text_char
/* Indented text may not start with characters which mark its end. */
indented_char       ::= text_char - "}" - "[" - "*" - "."
literal_escape      ::= "\\" special_quoted_char
unicode_escape      ::= "\\u" /[0-9a-fA-F]{4}/
/* The literal opening brace { is allowed in string literals because they may
 * not have placeables. */
quoted_char         ::= (text_char - special_quoted_char)
                      | special_text_char
                      | literal_escape
                      | unicode_escape
stasm commented 5 years ago

@Pike asked me to clarify the impact of this proposal, with examples. I'll start with the role of TextElement and StringLiterals.

TextElements are meant as the good default storage type for translations. Most translations should be possible to author using just TextElements. Because the Fluent Syntax is a transport format for text content, early on we decided to give text the best syntax there is: no syntax at all. That's why TextElements don't require any delimiters.

This proposal together with #186 are about making the {} pair the only special characters inside of TextElements. Seeing a { in text should instantly make users think Aha! This is special!.

StringLiterals are intended for short strings where TextElements are not suitable for the job, mostly because of the lack of delimiters (e.g. as arguments to functions, maybe as variant keys). They can also be used to emphasize something about the translation which would otherwise be lost to the reader.

It's the Simplicity principle applied.

Simple things should be simple. Complex things can be possible. The syntax used for describing translations should be easy to read and understand. At the same time it should allow, when necessary, to represent complex concepts from natural languages like gender, plurals, conjugations, and others.

I'd like StringLiterals to serve two goals. When reading Fluent files, I want localizer to instantly think Aha, there's something special going on here! when they see them. When writing Fluent, I want localizers to think This doesn't work in TextElements; I need to use a StringLiteral.

Recognizing backslash escapes in TextElements goes counter to both of these goals because it creates another way of achieving the intended result in some cases, but not all.

OK, time for some examples. The points to the result of FluentBundle.format().


# █ denotes a non-breaking space.
terms = Terms{"\u00A0"}and{"\u00A0"}Privacy
↓
Terms█and█Privacy

terms = Terms\u00A0and\u00A0Privacy
↓
Terms\u00A0and\u00A0Privacy

open-placeable = Use '{"{"}' to open a placeable.
↓
Use '{' to open a placeable.

# · denotes a leading space.
leading-whitespace = {"     "}There are 5 spaces in front of this.
↓
·····There are 5 spaces in front of this.

# █ denotes a non-breaking space.
markup = <em>Terms</em>{"\u00A0}"and{"\u00A0"}<em>Privacy</em>
↓
<em>Terms</em>█and█<em>Privacy</em>

In the browser, the above would render as:

Terms and Privacy

markup = <em>Terms</em>\u00A0and\u00A0<em>Privacy</em>
↓
<em>Terms</em>\u00A0and\u00A0<em>Privacy</em>

In the browser, the above would render as:

Terms\u00A0and\u00A0Privacy

markup = <em>Terms</em>&nbsp;and&nbsp;<em>Privacy</em>
↓
<em>Terms</em>&nbsp;and&nbsp;<em>Privacy</em>

In the browser, the above would render as:

Terms and Privacy

As you can see from these example, only {…} has any impact on the result of format(). All other text is treated literally and returned as-is. It's then up the bindings or the environment to perform more logic on this literal value.

Pike commented 5 years ago

Thanks for that post. For a hypothetical variant name, we'd have?

city-name = { $city ->
   [ "New\u00a0York" ] Nueva{"\u00a0"}York
  *[unknown] { $city }
}
↓
Nueva█York

Similar for arguments, though I couldn't make up a use-case there :-/

stasm commented 5 years ago

Yes, that's right.

For the arguments, I could imagine a far far future with JOIN($list, separator: "\u00A0").