projectfluent / fluent

Fluent — planning, spec and documentation
https://projectfluent.org
Apache License 2.0
1.4k stars 45 forks source link

Integrate small HTML subset into fluent syntax #237

Open Swatinem opened 5 years ago

Swatinem commented 5 years ago

First off, I really love the fluent syntax so far (even though I have not used it in production yet) compared to MessageFormat.

I also really like the idea of DOM Overlays, but I would like to deeper integrate this into the syntax itself.

But IMO the way that React Overlays are currently implemented is a bit strange, since it parses the final translated message on each usage of <Localized/>.

I think having a small subset of HTML integrated into the syntax can also make this a lot easier to integrate into component systems that are not browser based, or other usecases such as pre-compilation (I am the maintainer of intl-codegen for which I would love to integrate a DOM Overlay-like feature)

Maybe related to #96 or #175 ?

Pike commented 5 years ago

Interesting idea, though I guess the small subset is the interesting part to nail down. The way that would make sense to me is a small subset of SGML or XML parsing, so something that's completely opaque to tag names.

I could see a benefit from doing MarkupStartTag and MarkupEndTag nodes, as a post-processing to TextElement. In the ebnf, they'd be additional elements in PatternElement.

I'd go as far as to say that we should only allow start and end tags in the same pattern. Like, currently, one can do

omg = This is <b>{ $num ->
 *[other] bad</b>, <em>I
  }{ $num ->
  [one] guess this would be bad?
 *[other]</em> think
  }. Don't you?

The exact semantics depend on a few of questions, I think:

If all of these questions had "No" as an answer, this could quite nicely be implemented on the TextElement level. Which might just fix the escaping problems, too.

Making sure @zbraniecki and @spookylukey are on this.

Swatinem commented 5 years ago

What I had in mind by small subset was:

So essentially some xml/jsx like syntax, completely opaque to tag names so it plays nicely with the React Overlays feature.

The parser should make sure that the nesting is well-formed. I am not sure if attributes should support nesting of other elements? might be a good idea if the target is non-browser component systems, such as react, but might be a bad idea when the target is web.

spookylukey commented 5 years ago

Thanks @Pike for letting me know about this.

I'm using Fluent in two main contexts - django-ftl and elm-fluent (both of which I wrote/am writing). For my own purposes, I don't think this proposal would help. On the one hand, there are many cases where I want to generate plain text. In this case, you should be able to have a message like:

message = This is an article about the <blink> tag

and nothing funny should happen to it - it is just normal text. That text should be escaped if we happen to be using it in HTML context, but that is not the business of Fluent.

On the other hand, I sometimes want to output HTML, and in those cases need access to the full range of possible HTML constructs, not a limited subset. For me, I imagine having a limited amount of builtin HTML support would most likely just make these two things more complicated.

For django-ftl, to support these two types of output, I'm using my own branch of python-fluent, using a more 'escaping' mechanism (see django-ftl docs, outdated PR, discussion ).

For elm-fluent, I'm using a very different strategy. This package compiles FTL to Elm files, so it may have some relevance for your intl-codegen work @Swatinem. Like for django-ftl, messages are assumed to be plain text by default, and need -html appended to the message name to mark that they are HTML. In elm-fluent, however, for HTML messages the compiler outputs functions that return Html msg values, which are tree-like structures. Arbitrary HTML can be embedded into the messages themselves, though authors are encouraged to use the bare minimum, and add additional needed attributes using another mechanism (see docs ).

Doing this involves elm-fluent being able to parsing the messages as HTML after they have been parsed as FTL, but before rendering out. This is implemented here for elm-fluent, and it is a bit tricky/hacky, but it does work.

Overview for this method:

For messages that are marked as HTML, we take the FTL Pattern element, copy it, and then replace all placeables (or anything that isn't a TextElement) with a bit of text that contains a marker string. The replaced node is put into a dictionary where we can look it up later. Now, the whole pattern is a series of text elements, and we can concatenate it to a single string and parse it as HTML into an HTML tree structure. Then we go through the tree structure and find the bits of marker text (which might appear in attribute values or in HTML text nodes). Every time we find a marker, we now substitute back in the correct non-text elements we removed earlier, after recursively applying the same compilation strategy to render those elements.

This relies on some things e.g. well-formed-ness, and matching opening and closing tags within placeables etc. Since I'm doing this at compile time in elm-fluent, there isn't a runtime overhead associated with HTML parsing etc. - the output of the code I linked is not the final rendered message, but Elm code that will generate the final rendered message.

There is obviously a bit more to it than that, don't have time to write more now but I'll happily answer any questions about it.

Swatinem commented 5 years ago

Well I had something similar in mind, parse the string with both fluent / messageformat (as I currently stand, I only suport messageformat in intl-codegen, but I would love to suport both syntaxes) and an html parser, and then somehow combine both ASTs into one… I think that is certainly possible, just a lot more work than if you only had one AST to begin with :-)

And since I want to do all this at compile time anyway, the overhead would be minimal.

Apart from the implementation details itself, I think another big benefit would also be that you have one official documentation on how to write your fluent files themselves, instead of diverging and incompatible implementations in fluent dom, fluent react, elm-fluent, intl-codegen, and others (I could imagine fluent also being used together with something like azul)

I already see this problem with messageformat, where each library (such as my ideas for intl-codegen 2) bolts on its own extensions to the syntax, so the syntax the translators can actually use depends on the library that you use in your code, which gets even worse when you use different implementations in different parts of your code (server, web, native mobile).

spookylukey commented 5 years ago

To add briefly to my last comment - an FTL parser that was able to ensure the HTML well-formedness rules that Pike mentioned would be useful for all my use cases too, but only optionally, because of the need for plain text cases, and so it feels like doing this at the Fluent grammar level would be very complicated.

Swatinem commented 5 years ago

but only optionally, because of the need for plain text cases

Fluent has escapes/string literals for that, like {"a literal string <where> <tags> <are> <not> parsed"}