perpetualKid / GetText.NET

A .NET Standard cross-platform implementation of GNU Gettext
Other
34 stars 11 forks source link

Escaped strings are not correctly handled #25

Closed andy-noisyduck closed 3 years ago

andy-noisyduck commented 3 years ago

Escaped strings in the translated text are not decoded back to their original un escaped string, and are returned from GetString as is E.g. \' does not become '. It also looks like strings are not escaped when generating POT files from the extractors.

From the spec:

Each of untranslated-string and translated-string respects the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences.

I am not using the extractors so have not tested in detail, but it looks like escaping was previously implemented and removed in https://github.com/perpetualKid/GetText.NET/commit/85eb8367d6b0f93eeeff059bd46ed0b80ecff5e5 .

I have not tested the same behaviour in NGettext - we much prefer this fork.

perpetualKid commented 3 years ago

@andy-noisyduck I'm bit confused when you say

"It also looks like strings are not escaped when generating POT files from the extractors"

and

"I am not using the extractors"

i.e. referenced change was only targeting the extractor, not anything in catalog/GetString*.

do you have a sample string which you are trying to extract and/or translate? Thanks!

andy-noisyduck commented 3 years ago

I think there might be 2 closely related issues here. Maybe I should have made 2 tickets to be clearer.

First... translations with escaped strings in .mo files are not being unescaped when parsed. I.e. if you have an escaped single quote \' it will be returned including the backslash, not just the single quote. If you have an escaped newline \n in your translation string it won't be converted to an actual new-line etc.

The GetText spec indicates that any C style escape string is valid in a translation string within a .po file. My understanding was that strings in .mo files follow the same format and should thus be un-escaped either when parsing or when returned through GetString - although I concede there is scope for ambiguity here, as the .mo file format spec just refers to them as "translation strings" and doesn't explicitly document their format.

The existing tooling we use escapes new-line in .mo files, so I'm leaning towards unescaping them being the correct behaviour.

Secondly, a similar issue but on the input rather than output side. It looks like the extractors only escape double quotes now, and don't correctly handle new-lines, tabs etc. I wanted to caveat that we don't really use the extractors, and was something I only noticed when going through the source. It's possible I missed some handling of escape stings elsewhere.

perpetualKid commented 3 years ago

as for the second issue, the extractor escapes non-printable characters here:

https://github.com/perpetualKid/GetText.NET/blob/146090a0f01254a666f3c828e43a50980d8fe86a/src/GetText.Extractor/Engine/ParserBase.cs#L148-L151

however that doesn't work for double quotes, hence the fix in 85eb8367d6b0f93eeeff059bd46ed0b80ecff5e5. The removed code was stale copy&paste leftover from an xgettext implementation.

This should work correctly, but if you have some sample where this fails, please provide!

For the first issues, looking into it. What tooling are you using? The scenario is primarily tested with the extractor and PoEdit, from other issues I've seen that some of the older GNU commandline utilities are more peculiar and not as relaxed as Windows-based tools. As far reasonable I'm open to adapt/update.

andy-noisyduck commented 3 years ago

For the first issues, looking into it. What tooling are you using? The scenario is primarily tested with the extractor and PoEdit, from other issues I've seen that some of the older GNU commandline utilities are more peculiar and not as relaxed as Windows-based tools. As far reasonable I'm open to adapt/update.

We manage translations through Phrase.com, and we use it to generate our .mo files.

It's unclear where the problem is, whether its the generating of the .mo file or the parsing of them. Should the sequence \n in an .mo file be a newline or the literal \n? The somewhat ambiguous GetText spec doesn't really help here.

We can work around it in most cases by not having escape strings in the source content (.pot files), but this doesn't work for newlines or double quotes which must always be escaped. Right now we are unescaping them manually after we call GetString but that feels a bit clunky.

perpetualKid commented 3 years ago

this really gets tricky, and the GetText spec isn't of particular help either, but it seems that Phrase is handling escaped strings differently at least.

assuming a string like this

string test = catalog.GetString("This is a long string with \nline break");

extracting this, ie using xgettext, PoEdit, or GetText.Extractor, will become like this in the pot template file

#: Form1.cs:34
#, csharp-format
msgid ""
"This is a long string with \n"
"line break"
msgstr ""

it's mentioned as convention but not required in the GetText documentation that line breaks in the strings, will also induce a line break in the msgid format.

#: Form1.cs:34
#, csharp-format
msgid ""
"This is a long string with \nline break"
msgstr ""

in PoEdit, looks like this

image

So I've created an test account at Phrase.com, and uploaded the generated pot file there - and oops - the line break is gone?!

image

Now the even stranger part. While a generated po-translation file from phrase looks ok (except it does not break the strings itself, but as mentioned above that's optional in the documentation, it does escape the strings when compiling to mo:

image image

vs. (using PoEdit)

image image

with that escaping, the translation will not even be found, because the MessageId (original string to look for) was "This is a long string with \nline break" but in the compiled catalog only find "This is a long string with \nline break".

While I originally would have no opinion which behavior is correct, I think the phrase.com output is just wrong by escaping the strings both for the message id (as it breaks the whole lookup) as well the translation as it breaks formatting behavior.

If you have a business relationship with phrase, I'd ask you to open an support issue with them, happy to join the conversation if needed. Otherwise I see limited option here to update the code, if the same input produces different output from different translation engines, how should GetText.Net else handle this?

andy-noisyduck commented 3 years ago

Thanks for investigating this.

The missing linebreak in Phrase is actually correct. The list on the left is a preview list of all your translated strings, similar to how Poedit displays the text without breaks in the list at the top. It's probably just not obvious with only one string. Phrase then shows the full string (including linebreaks) in the middle section when translating. I'm not sure why your example doesn't have the text here - it might be that Phrase's default mode of operation is to use keys for strings and not the source text (as Phrase supports a bunch of other non-GetText file formats where this makes sense), and the option for this wasn't set when parsing the po/pot file.

That said, Phrase is clearly mishandling .mo file output.

I probably should have done what you did, and compared the output from Phrase with the GetText command line tools. Whilst the GetText spec is a bit vague on .mo file format, using the msgfmt tool seems a reasonable choice here as an authoritative answer. When writing the binary .mo file format msgfmt takes any escaped strings in the .po source and converts them to their binary equivalent (i..e the 2 character literal backslash + n \n becomes just the newline character). I'm going to assume that's the correct behaviour, and raise this as a bug with Phrase.

That said, what is also strange is that msgfmt doesn't correctly handle the escape sequence for a single quote. A single quote ' is a valid escape sequence when used as \', and although the GetText docs explicitly mentioning this as valid, xgettext will unescape them when parsing. If the 2 char sequence \' is in the text, msgfmt will actually return invalid control sequence when reading the po file. When using the 4-char hex escape sequence for the the quote \x27 it parses it correctly. That seems like a bug in the GetText tooling to me.

This issue can probably be closed?

perpetualKid commented 3 years ago

thanks for confirming my research! I'll do some more testing around the \' sequence, but if this is purely affecting msgfmt I assume (or hope :) ) GetText.NET is handling it correctly.

In the meanwhile closing this issue.