HTML parser used for markup in translations breaks self-closing elements

stasm commented 6 years ago

Both fluent-dom and fluent-react currently use the <template> element to safely parse markup in translations. The <template> element internally uses the HTML parser. This is similar in effect to using DOMParser.parseFromString with text/html as supported type.

The HTML parser has rather strict rules on parsing void elements, unlike the XML parser. It only allows the elements known to be void (e.g. input) to be self-closing (<element/>). Non-void elements are parsed as if they weren't properly closed.

This means that the following markup:

A <span/> B

…is parsed as:

A <span> B</span>

The consequence for overlays is that localizers can't use the self-closing form with non-void elements. Admittedly, they're not valid HTML markup, but the difference is rather subtle:

# This inserts "!" into the span.
hello = Hello, <span data-l10n-name="user" />!

# This works as expected.
hello = Hello, <span data-l10n-name="user"></span>!

stasm commented 6 years ago

One way to fix this would be to use DOMParser with the application/xml type for parsing markup in translations. XML has stricter parsing rules in general but it also allows any elements to be self-closing and it still produces well-formed documents.

We'd need to add the DTD for at least some HTML entities that localizers might expect to work. The full DTD is huge, clocking in at over 11KB (preview). Perhaps we could only cherry pick and support whitespace characters, like   and ? In any case, it would be useful to strictly define the list of entities that we support.

flodolo commented 6 years ago

I think the practical impact of not supporting this is very limited.

From my experience reviewing sign-offs, I don't think I've ever seen localizers changing the markup used in the reference string. If it's <span></span>, it remains unchanged.

In some cases I've seen <img id="foo" /> losing the space before the closing slash, but that's as far as changes go.

cvle commented 6 years ago

I think I ran into this, I tried to do this:

<Localized
  id="comments-userBoxAuthenticated-signedInAs"
  username={<Username />}
>
  <Typography variant="bodyCopy" container="div"> 
    {"Signed in as <username />."}
  </Typography>
</Localized>

comments-userBoxAuthenticated-signedInAs =
  Signed in as <username/>.

But the . is swallowed at the end while this works:

<Localized
  id="comments-userBoxAuthenticated-signedInAs"
  username={<Username />}
>
  <Typography variant="bodyCopy" container="div"> 
    {"Signed in as <username></username>."}
  </Typography>
</Localized>

comments-userBoxAuthenticated-signedInAs =
  Signed in as <username></username>.

stasm commented 6 years ago

Yes, this looks familiar. I wrote a test to document the current (broken) behavior: https://github.com/projectfluent/fluent.js/blob/6e1ab3a4982ef6d5f385c77d15a16b27abc82a2e/fluent-react/test/localized_overlay_test.js#L618-L643

Do you think that using <username></username> is an acceptable work-around?

cvle commented 6 years ago

Do you think that using is an acceptable work-around?

It's not pretty, and it goes against user expectations. The workaround is simple and functioning though, however I'd like to see this as a temporary solution only.

stasm commented 6 years ago

I'm not fond of it either. Perhaps using DOMParser and treating translations as XML rather than HTML, would work better? I'd need to check the performance impact, too.

stasm commented 6 years ago

More importantly, <template> creates an inert DocumentFragment, which is also good for security.

stasm commented 6 years ago

Other solutions include using one of the available HTML-to-React parsers, like html-react-parser or react-html-parser. They add quite much to the size of fluent-react, however (20KB and 70KB respectively).

I suspect they ship with so much code to support a lot of HTML's edge-cases. There might be a limited subset of safe markup which we could write our own parser for. I'll be away for a few weeks, but I can take a closer look when I'm back.

nchevobbe commented 3 years ago

I'm not fond of it either. Perhaps using DOMParser and treating translations as XML rather than HTML, would work better? I'd need to check the performance impact, too.

I think that would be a nice solution.

We're using fluent-react in Firefox DevTools, and the document can be an xhtml one, and in such case custom tags are ignored:

temp =  document.createElement("template")
temp.innerHTML = "<a>test</a>|<custom>testCustom</custom>"
Array.from(temp.content.childNodes) // [ a, #text, #text ]

whereas with DOMParser it works as expected:

parser = new DOMParser()
doc = parser.parseFromString("<a>test</a>|<custom>testCustom</custom>", "text/html")
Array.from(doc.body.childNodes) // [ a, #text, custom ]

And sometimes we don't even have a document reference, which makes some of panel to fail loading when trying to update our version of fluent-react to 0.16.1

Pike commented 3 years ago

Changing the parser logic can introduce subtle regressions, the HTML fixup is probably different between document and element. And it needs a security review, as Stas mentioned.

I still like the lack of XML complexity in our DOM overlay stories.

projectfluent / fluent.js

HTML parser used for markup in translations breaks self-closing elements #188