projectfluent / fluent

Fluent — planning, spec and documentation
https://projectfluent.org
Apache License 2.0
1.41k stars 45 forks source link

Rules for normalizing multiline text #122

Closed stasm closed 6 years ago

stasm commented 6 years ago

Localizers should be in control of the contents and the formatting of their translations. I'd like to define rules for normalizing multiline text in the runtime to eliminate the surprises caused by how the bindings might normalize text inherently.

Multiline text is currently parsed and stored in the AST with all its whitespace, with the exception of leading and trailing whitespace. This is helpful for serialization and I'd like to keep it that way.

At the same time, the Fluent spec doesn't mandate any normalization of the multiline text. A multiline translation showing up in an HTML UI will have all its newlines normalized to a single space. On the other hand, a multiline translation which is displayed with an alert() will preserve all its newlines verbatim.

Pike commented 6 years ago

This is both backwards and forwards incompatible, right?

That aside, I'm not in favor of this proposal.

We don't have a strong list of examples by projects that drive this right now, and I'm not sure where this will pan out to be a problem. alert isn't really good UX in any way ;-)

I'm mostly concerned that at the point where a fluent author needs to control white-space, they need to understand the semantics of how to break out of normalization, and I find that hard to convey. Saying that writing some sort of markdown right now, which is, well, some sort of. Also, I don't control white-space in my own experience in md, but I play and preview and edit until it's somewhat bearable.

stasm commented 6 years ago

This is both backwards and forwards incompatible, right?

Or neither? If we only change the behavior of the runtime, the syntax would remain the same.

Or, we might want to introduce a new syntax for opting out of the default normalization.

We don't have a strong list of examples by projects that drive this right now, and I'm not sure where this will pan out to be a problem. alert isn't really good UX in any way ;-)

That's because we're mostly an HTML-first organization. As soon as we step into other territories (Rust, C# and game engines, chat interfaces), the problem of translations being multiline where localizers only intended to format them nicely to fit on the screen becomes real and pressing.

Pike commented 6 years ago

As far as compat goes, we should read this as "can I use old files with new fluent and vice versa". Parsing is a step there, but restricting it to parsing doesn't represent the practical concerns which we need to address.

Pike commented 6 years ago

Different train of thought:

Should we expect multi-line text in scenarios where we don't expect multi-line rendering? Sure, one can do

a_dog = 
  A
  dog

but I don't think we need to address that as a problem. OTH

wall_of_text = 
  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

is unlikely to render correctly in the target platform, right?

stasm commented 6 years ago

Sorry, I didn't get your last comment. Would you mind trying to rephrase?

Pike commented 6 years ago

After talking with flod a bit, let's put this in a different way.

In APIs where newlines are significant we might still want to create newlines. I think the current proposal is to use blank lines to indicate that. That would make a three-line string look like this:

wall-of-text = 
  Lorem ipsum dolor sit amet, consectetur adipiscing elit,

  sed do eiusmod tempor incididunt ut labore et dolore

  magna aliqua.

The main drawback here is that in an HTML scenario, this still renders as a single line.

I don't think that the mismatching expectations on how a string is rendered between fluent source and target application can be solved the way that it's proposed here.

I also don't like the visuals of an actual multi-line string.

I think that we should use semantic comments instead, and signal the white-space significance from the developer to the localization tool, with project-wide defaults at some point.

The proposal here would drive white-space layout by the individual localization of a particular string, and I don't think that's good for developer experience as well.

stasm commented 6 years ago

For a comic relief, check out the SO answer on how to write multiline strings in YAML.

It's off-topic but it should serve as a warning that even with best intentions it's easy to create systems which users find very complex.

stasm commented 6 years ago

The main drawback here is that in an HTML scenario, this still renders as a single line.

I think I'm starting to understand where our thinking differs: you seem to think about the end of the formatting pipeline, which is rendering, whereas my intent was to define the behavior of the bindings. Think Localization.formatValue in fluent-dom. The environment still controls the rendering which is totally fine in my book.

Then, there's parsing—do I understand correctly that we both agree to preserve as much of the message layout in the AST as possible?

I think that we should use semantic comments instead, and signal the white-space significance from the developer to the localization tool, with project-wide defaults at some point.

I agree that we semantic comments will be very helpful here. We'll need them anyways to make tools and ultimately localizers aware that a given translation may or may not use custom HTML. It's not only about whitespace normalization: depending on whether the message is used in HTML or through raw JS, a custom <em/> might be OK or not.

I think I've had enough time to digest the feedback. My goal is to design something robust, modular and extensible. Thus, I propose a new plan:

For instance, a hypothetical fluent-cli bindings module for translating CLI apps could use the fluent-normalize-markdown module to normalize all translations. Files used by fluent-cli should have a @format markdown semantic comment (resource-wide) to let tools know what kind of formatting is going to be performed on the translations.

In another example (I'm not saying we should do this; it's just an example), fluent-dom could run formatValue through a fluent-normalize-singleline module. Messages formatted imperatively from the JS code would then have their contents normalized to a single line, regardless of how many line breaks they had in the Fluent file. These messages would need to have a @format singleline semantic comment attached to them, or to their group or resource.

This approach leaves the normalization up to the consumers by sticking to the principle of the least power. By storing the newlines in the AST, we allow the consumers to dicide if they want to handle them in any particular manner. At the same time, this proposal emphasizes the importance of semantic comments (#139). Develoeprs and tool authors will need to rely on them anyways to convey information about the target environment in which translations will be displayed. The environment may define not only its own whitespace normalization, but also which features are even allowed in translations.

stasm commented 6 years ago

I think I'm starting to understand where our thinking differs: you seem to think about the end of the formatting pipeline, which is rendering, whereas my intent was to define the behavior of the bindings.

A note to future self: this was an important realization, as it made me understand that in neither case does the multiline normalization belong to the low-level formatting, i.e. MessageContext.format.

stasm commented 6 years ago

Closing this as WONTFIX as per my above comments.