smhg / gettext-parser

Parse and compile gettext po and mo files, nothing more, nothing less
MIT License
158 stars 43 forks source link

Tolerate additional entries in obsolete messages #79

Open mathieucaroff opened 1 year ago

mathieucaroff commented 1 year ago

I have a .po file whose translations are handled with Poedit and I get obsolete entries in my .po files. The parser fails when it encounters them.

Reproduction

en.po

#, fuzzy
#~| msgid "Latest version"
#~ msgid "Latest Version"
#~ msgstr "Latest version"

Parsing the content of en.po with gettext-parser (running gettextParser.po.parse(content)) produces the following error:

node_modules\gettext-parser\lib\poparser.js:224
            throw err;
            ^

SyntaxError: Error parsing PO data: Invalid key name "|" at line 2. This can be caused by an unescaped quote character in a msgid or msgstr value.

Expected behavior

The parser should not crash. It should either ignore the entry, or properly parse it to data.

Misc

I believe this issue was introduced by the following PR: https://github.com/smhg/gettext-parser/pull/64 -- released in version 5.0.0. 4.2.0 is the last version before 5.0.0.

Have a good day !

smhg commented 1 year ago

Thank you for reporting this! After some initial investigation any additional character after an obsolete marker (#~) seems to lead to this error. This is unrelated to the fuzzy flag. So currently only basic obsolete messages are supported. Obsolete messages currently cannot contain previous messages (|), references (:),... It would definitely be good to support this. A PR would be very much appreciated.

smhg commented 1 year ago

@KapitanOczywisty would you be able to have a look at this?

KapitanOczywisty commented 1 year ago

So currently only basic obsolete messages are supported. Obsolete messages currently cannot contain previous messages (|), references (:),...

@smhg Yes, they can, by using #| and #:. I don't know where #~| came from, but It's not included anywhere in gettext documentation. All comments are connected to their msgid, so if that msgid is obsolete there is no need to mark comments as obsolete. Reversing that: why not #~, fuzzy? You never see #~,, likely because It's popular enough that people seen it correctly used and implemented.

smhg commented 1 year ago

@KapitanOczywisty If I understand correctly #| means a previous translation and #~| is an obsolete previous translation? I can't tell whether that is actually usefull to programs, but moving from non-obsolete to obsolete just adds the ~ so it makes sense that's the result? And obsolete message parsing currently doesn't look at the 3rd character correctly I think?

KapitanOczywisty commented 1 year ago

If I understand correctly #| means a previous translation and #~| is an obsolete previous translation?

I only found that: https://github.com/winlibs/gettext/blob/24b0fe071cff79acf4a3904591b481328732ddc1/source/gettext-tools/src/po-lex.c#L903-L906

And obsolete message parsing currently doesn't look at the 3rd character correctly I think?

I thinks parsing ignores whitespace and searches for msgid keyword. I can make this work wit #~| though this creates some questions:

mathieucaroff commented 1 year ago

I believe the simplest, non-surprising behavior would be to treat as a regular comment, if that makes sense with regards to the way gettext-parser currently works.

smhg commented 1 year ago

Basically #~| should be treated as if it were #~#|, right? It should be listed like this one I think. These are already parsed out.

But looking a the code, this does not look straighforward. First this @KapitanOczywisty: do you remember why you grouped the obsolete state with the none state? Wouldn't it be better to separate those? Because different parts of the checks under those are applicable to different states, no?

KapitanOczywisty commented 1 year ago

do you remember why you grouped the obsolete state with the none state?

At the time I was not aware of #~| so only correct use would be with keywords e.g. #~ msgid. I can fix that, just we need to decide how #~| should be handled. I'd prefer to treat it as #|, which means parse and compilation would yield different results than input, but all other solutions have major problems with compatibility anyway.

smhg commented 1 year ago

As an example: if you prepend it to this line, it would appear on this line. Right? If not, can you give an example when it gets ambiguous?

KapitanOczywisty commented 1 year ago

In my suggestion, this file:

#, fuzzy
#~| msgid "Latest version"
#~ msgid "Latest Version"
#~ msgstr "Latest version"

after parse would become:

{
    // ...
    "obsolete": {
        "": {
            "Latest Version": {
                "msgid": "Latest Version",
                "comments": {
                    "flag": "fuzzy",
                    "previous": "msgid \"Latest version\""
                },
                "msgstr": [
                    "Latest version"
                ]
            }
        }
    }
}

and then after compilation:

#, fuzzy
#| msgid "Latest version"
#~ msgid "Latest Version"
#~ msgstr "Latest version"