locize / xliff

xliff2js and js2xliff converter xliff utils
MIT License
80 stars 37 forks source link

Weird XLIFF parsing issue #18

Closed villelahdenvuo closed 5 years ago

villelahdenvuo commented 5 years ago

I'm trying to parse an XLIFF 1.2 file using the xliff12ToJs function, but for some reason it cuts out the last line of a multiline html translation if the ending target tag is on the same line. Here is a minimal repro: https://runkit.com/embed/6ahusa56f2r5

If I add a linebreak before the </target> it parses correctly.

adrai commented 5 years ago

Me or @probertson will look into it...

villelahdenvuo commented 5 years ago

Great, for now I'm relying on a regex to add linebreaks before parsing it.

adrai commented 5 years ago

If I add a linebreak before the </target> it parses correctly.

also before </source>

villelahdenvuo commented 5 years ago

Yes, both elements fail to parse correctly, so it may be an issue with the parser library instead.

adrai commented 5 years ago

I know, why this happens, but I don't know if it's a feature or a bug... It's this line: https://github.com/locize/xliff/blob/master/xml-js/xmlToObject.js#L16 I hope @probertson can help.

villelahdenvuo commented 5 years ago

If it's a feature it's pretty bad because it's causing me to lose my data when converting formats. :smile: I can look into providing a PR for a fix, but we need to see what problem is that code solving.

probertson commented 5 years ago

Sorry for not joining the conversation sooner -- I was asleep before (time zones 😄) but I'll take a look now.

probertson commented 5 years ago

You're right @adrai, that line is causing the problem:

return valueElement.text.indexOf('\n') === -1 ? valueElement.text : valueElement.text.substr(0, valueElement.text.lastIndexOf('\n'));

I added that line because of a quirk with the tests. It was a weird case where the reader and writer didn't round-trip correctly in some cases.

The short answer is that there are two options:

  1. we can remove the conditional in that line, and it will fix the issue as long as we take out the whitespace in some of the test fixtures (so for those fixtures, the XML will just be one long string instead of formatted on multiple lines).
  2. I can improve the logic of that conditional so it works in the case where the text is multiline and the closing tag is on the same line as the last line of text.

Here's a long-ish explanation, but to be clear I'm not trying to justify this behavior or say it's correct -- just explain this "feature."

That line was to handle an edge case with the The way the parsing library handles whitespace, it either treats all whitespace as significant or no whitespace as significant. That makes sense, but it causes a problem when an element contains multiple children, specifically:

For example:

<source>
  <ph id="name" ctype="x-i18next-param">{{name}}</ph> is 
  <ph id="how" ctype="x-i18next-param">{{how}}</ph>
</source>

There is a space after the word "is" at the end of the second line, and that space is significant whitespace (the string it represents is "{{name}} is {{how}}"). However, all the other space is not significant whitespace.

In the real world I would probably write out unformatted xml (without line breaks and indentation) so it wouldn't be an issue -- all whitespace would be significant. In this case, to make the test fixture easier to read I left it formatted, and tried to handle the case where there is significant whitespace followed by non-significant whitespace.

I mentioned the solution ideas above. I will start working on approach 2 (fixing the bug), but if you have a strong preference for 1 let me know and I'll remove the conditional and update the test fixtures instead.

probertson commented 5 years ago

Another solution would be if the parsing library allowed fine-grained control over indentation and line breaks, but that's probably overkill for most cases since only humans care about readable XML, not machines 😄

villelahdenvuo commented 5 years ago

I do not personally like code that is for tests outside of the test folder, but I understand maintaining the formatting of the fixtures is nice. Perhaps the issue could be handled in the tests, for example minifying the xml (non-sensitive spaces) before the comparison?

However I'm fine with solution 2 as long as it's robust enough. I'm sure you can come up with something better than my current regex hack. 😄

villelahdenvuo commented 5 years ago

The problem in my case is that I use a tool called xliffmerge to copy updated translate units to the locale files and it removes any linebreak at the end of the elements, so even if I had a linebreak after adding new translations it would remove them and then lose the last line of the translation.

adrai commented 5 years ago

fixed in v4.1.3 thanks to the awesome help of @probertson ;-)