nexB / source-inspector

Tools to inspect source code and code symbols
0 stars 1 forks source link

xgettext: rerun with UTF-8 encoding and/or properly process failures #14

Open armijnhemel opened 3 months ago

armijnhemel commented 3 months ago

Something to consider is to rerun xgettext with different parameters in case it fails. The xgettext manual says:

By default the input files are assumed to be in ASCII.

Sometimes this will lead to incorrect results (or no results at all) and xgettext might be needed to rerun with a different option. One example where fidks fails is util-linux/fdisk.c from a recent BusyBox:

$ xgettext --omit-header --extract-all --no-wrap fdisk.c
xgettext: Non-ASCII string at fdisk.c:333.
          Please specify the source encoding through --from-code.

The culprit here is actually this sequence:

    "\x80" "Old Minix",        /* Minix 1.4a and earlier */

where xgettext thinks this might be some UTF-8 character (but, of course, it is not a valid sequence). No output file is generated in this case.

https://git.busybox.net/busybox/tree/util-linux/fdisk.c?h=1_35_stable

Another example is the attached file (lineedit.c from BusyBox, zipped) where I have replaced a string on line 893.

$ xgettext --omit-header --extract-all --no-wrap lineedit.c
xgettext: Non-ASCII string at lineedit.c:893.
          Please specify the source encoding through --from-code.

and no output file will be created.

When using the --from-code parameter the string will not be correctly extracted, but an output file will be created:

$ xgettext --omit-header --extract-all --no-wrap --from-code=UTF-8 lineedit.c
lineedit.c:442: warning: internationalized messages should not contain the '\r' escape sequence
lineedit.c:893: warning: The following msgid contains non-ASCII characters.
                         This will cause problems to translators who use a character encoding
                         different from yours. Consider using a pure ASCII msgid instead.
                         ë
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence
lineedit.c:893: invalid multibyte sequence

It is not ideal, but better than getting no data at all. This could use some refinement.

Please note that this isn't true for all languages according to the xgettext manual:

       --from-code=NAME
              encoding of input files (except for Python, Tcl, Glade)
pombredanne commented 3 months ago

Thanks! can you attach the problematic files? (with a URL so we can track where they are from)

pombredanne commented 3 months ago

Please note that this isn't true for all languages according to the xgettext manual:

What a mess :|

armijnhemel commented 3 months ago

Thanks! can you attach the problematic files? (with a URL so we can track where they are from)

lineedit.c.zip

I thought I had added it. The attached file is not the original, but slightly modified.

armijnhemel commented 3 months ago

Please note that this isn't true for all languages according to the xgettext manual:

What a mess :|

Not really as these are (I guess) treated as utf-8 by default?

armijnhemel commented 3 months ago

So, the solution for this seems to be to actually not pass --omit-header to xgettext.

armijnhemel commented 3 months ago

Of course, not every file that is non-ASCII will be in UTF-8, so it might be that some additional metrics are needed to find the right encoding to translate from. Alternatively, first try UTF-8, then Latin-1, then others. That will catch many instances.

If the file is not UTF-8, it might be that you would need to translate (encode/decode, etc.) the extracted string before further processing.

pombredanne commented 3 months ago

I have a proper encoding detection here https://github.com/nexB/typecode/blob/92feb7be3a87c1b541e7034c3f9797c96bc52305/src/typecode/magic2.py#L294 or something else here https://github.com/nexB/scancode-toolkit/blob/c80e502c06639c18e2ea606d63f2ac09f89230c1/src/textcode/analysis.py#L251 we could use at some point of time

pombredanne commented 3 months ago

Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. :zany_face:

xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...

it will complain from non-UTF-8 content, but will keep on trucking

armijnhemel commented 3 months ago

Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪

xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...

it will complain from non-UTF-8 content, but will keep on trucking

Just dropping --omit-header would have been enough.

I ran the following two commands on the example I provided:

$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=bar.po  --from-code=UTF-8 lineedit.c
$ xgettext  --extract-all --no-wrap --output=foo.po  --from-code=UTF-8 lineedit.c

and then ran diff on the outputs:

$ diff -u foo.po bar.po 
--- foo.po  2024-03-16 13:27:54.506623896 +0100
+++ bar.po  2024-03-16 13:27:31.994449031 +0100
@@ -1,5 +1,5 @@
 # SOME DESCRIPTIVE TITLE.
-# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
+# Copyright (C) YEAR ø
 # This file is distributed under the same license as the PACKAGE package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
 #
armijnhemel commented 3 months ago

Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪 xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ... it will complain from non-UTF-8 content, but will keep on trucking

Just dropping --omit-header would have been enough.

I ran the following two commands on the example I provided:

$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=bar.po  --from-code=UTF-8 lineedit.c
$ xgettext  --extract-all --no-wrap --output=foo.po  --from-code=UTF-8 lineedit.c

and then ran diff on the outputs:

$ diff -u foo.po bar.po 
--- foo.po    2024-03-16 13:27:54.506623896 +0100
+++ bar.po    2024-03-16 13:27:31.994449031 +0100
@@ -1,5 +1,5 @@
 # SOME DESCRIPTIVE TITLE.
-# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
+# Copyright (C) YEAR ø
 # This file is distributed under the same license as the PACKAGE package.
 # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
 #

I am using this version of xgettext btw.

$ xgettext --version
xgettext (GNU gettext-tools) 0.22
Copyright (C) 1995-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper.
pombredanne commented 2 months ago

re:

Just dropping --omit-header would have been enough.

But then it does not work if I do not know the encoding or it will create plain ASCII and NOT UTF-8 encoded .po file.

Using this file foo.c.zip and xgettext 0.21:

$ xgettext  --extract-all --no-wrap --output=no-copyright-with-utf.po  --from-code=UTF-8 foo.c 
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ xgettext  --extract-all --no-wrap --output=no-copyright-no-utf.po   foo.c 
xgettext: Non-ASCII string at foo.c:3.
          Please specify the source encoding through --from-code.
pombreda@computer4:~/tmp/xg/chardet-main/tests/iso-8859-2-slovene$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=copyright-no-utf.po  foo.c 
xgettext: Non-ASCII string at foo.c:3.
          Please specify the source encoding through --from-code.
$ xgettext --copyright-holder="ø"  --extract-all --no-wrap --output=copyright-with-utf.po  --from-code=UTF-8 foo.c 
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ file *
copyright-with-utf.po:    GNU gettext message catalogue, Unicode text, UTF-8 text, with very long lines (399)
foo.c:                    ISO-8859 text, with very long lines (799)
foo.c.zip:                Zip archive data, at least v2.0 to extract, compression method=deflate
no-copyright-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)

$ xgettext --omit-header  --extract-all --no-wrap --output=omit-with-utf.po  --from-code=UTF-8 foo.c 
foo.c:3: warning: The following msgid contains non-ASCII characters.
                  This will cause problems to translators who use a character encoding
                  different from yours. Consider using a pure ASCII msgid instead.
                  LJUBLJANA ? Zavod RS za zaposlovanje je na svojih spletnih straneh vzpostavil novo rubriko Skupaj do zaposlitve, v okviru katere bodo uporabnikom na voljo primeri dobrih praks oz. uspe�nih zgodb brezposelnih oseb, iskalcev zaposlitve in delodajalcev iz vse Slovenije. Trenutno so v rubriki objavljene tri zgodbe. Svoje izku�nje so javnosti zaupali Branko Ileni�, Darja Avgu�tin�i� in Mojca Rupert.
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ file *
copyright-with-utf.po:    GNU gettext message catalogue, Unicode text, UTF-8 text, with very long lines (399)
foo.c:                    ISO-8859 text, with very long lines (799)
foo.c.zip:                Zip archive data, at least v2.0 to extract, compression method=deflate
no-copyright-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)
omit-with-utf.po:         GNU gettext message catalogue, ASCII text, with very long lines (399)

Hence why using a fake copyright works to get a UTF-8 output. All other modes can parse non-UTF BUT will return some random ASCII-like encoding