Open armijnhemel opened 3 months ago
Thanks! can you attach the problematic files? (with a URL so we can track where they are from)
Please note that this isn't true for all languages according to the xgettext manual:
What a mess :|
Thanks! can you attach the problematic files? (with a URL so we can track where they are from)
I thought I had added it. The attached file is not the original, but slightly modified.
Please note that this isn't true for all languages according to the xgettext manual:
What a mess :|
Not really as these are (I guess) treated as utf-8 by default?
So, the solution for this seems to be to actually not pass --omit-header
to xgettext
.
Of course, not every file that is non-ASCII will be in UTF-8, so it might be that some additional metrics are needed to find the right encoding to translate from. Alternatively, first try UTF-8, then Latin-1, then others. That will catch many instances.
If the file is not UTF-8, it might be that you would need to translate (encode/decode, etc.) the extracted string before further processing.
I have a proper encoding detection here https://github.com/nexB/typecode/blob/92feb7be3a87c1b541e7034c3f9797c96bc52305/src/typecode/magic2.py#L294 or something else here https://github.com/nexB/scancode-toolkit/blob/c80e502c06639c18e2ea606d63f2ac09f89230c1/src/textcode/analysis.py#L251 we could use at some point of time
Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. :zany_face:
xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...
it will complain from non-UTF-8 content, but will keep on trucking
Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪
xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...
it will complain from non-UTF-8 content, but will keep on trucking
Just dropping --omit-header
would have been enough.
I ran the following two commands on the example I provided:
$ xgettext --copyright-holder="ø" --extract-all --no-wrap --output=bar.po --from-code=UTF-8 lineedit.c
$ xgettext --extract-all --no-wrap --output=foo.po --from-code=UTF-8 lineedit.c
and then ran diff
on the outputs:
$ diff -u foo.po bar.po
--- foo.po 2024-03-16 13:27:54.506623896 +0100
+++ bar.po 2024-03-16 13:27:31.994449031 +0100
@@ -1,5 +1,5 @@
# SOME DESCRIPTIVE TITLE.
-# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
+# Copyright (C) YEAR ø
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
Actually to get a proper UTF-8 output, you need to fool xgettext with a unicode character as explained in https://bloomfield.online/posts/generate-utf-8-dictionaries-using-gettext/ by @peter-bloomfield ... but rather than to add an extra file, adding a dummy header copyright or similar with unicode content seems to be enough. 🤪
xgettext --copyright-holder="ø" --extract-all --no-wrap --output=foo.po --from-code=UTF-8 ...
it will complain from non-UTF-8 content, but will keep on truckingJust dropping
--omit-header
would have been enough.I ran the following two commands on the example I provided:
$ xgettext --copyright-holder="ø" --extract-all --no-wrap --output=bar.po --from-code=UTF-8 lineedit.c $ xgettext --extract-all --no-wrap --output=foo.po --from-code=UTF-8 lineedit.c
and then ran
diff
on the outputs:$ diff -u foo.po bar.po --- foo.po 2024-03-16 13:27:54.506623896 +0100 +++ bar.po 2024-03-16 13:27:31.994449031 +0100 @@ -1,5 +1,5 @@ # SOME DESCRIPTIVE TITLE. -# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER +# Copyright (C) YEAR ø # This file is distributed under the same license as the PACKAGE package. # FIRST AUTHOR <EMAIL@ADDRESS>, YEAR. #
I am using this version of xgettext
btw.
$ xgettext --version
xgettext (GNU gettext-tools) 0.22
Copyright (C) 1995-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper.
re:
Just dropping --omit-header would have been enough.
But then it does not work if I do not know the encoding or it will create plain ASCII and NOT UTF-8 encoded .po file.
Using this file foo.c.zip and xgettext 0.21:
$ xgettext --extract-all --no-wrap --output=no-copyright-with-utf.po --from-code=UTF-8 foo.c
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ xgettext --extract-all --no-wrap --output=no-copyright-no-utf.po foo.c
xgettext: Non-ASCII string at foo.c:3.
Please specify the source encoding through --from-code.
pombreda@computer4:~/tmp/xg/chardet-main/tests/iso-8859-2-slovene$ xgettext --copyright-holder="ø" --extract-all --no-wrap --output=copyright-no-utf.po foo.c
xgettext: Non-ASCII string at foo.c:3.
Please specify the source encoding through --from-code.
$ xgettext --copyright-holder="ø" --extract-all --no-wrap --output=copyright-with-utf.po --from-code=UTF-8 foo.c
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ file *
copyright-with-utf.po: GNU gettext message catalogue, Unicode text, UTF-8 text, with very long lines (399)
foo.c: ISO-8859 text, with very long lines (799)
foo.c.zip: Zip archive data, at least v2.0 to extract, compression method=deflate
no-copyright-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)
$ xgettext --omit-header --extract-all --no-wrap --output=omit-with-utf.po --from-code=UTF-8 foo.c
foo.c:3: warning: The following msgid contains non-ASCII characters.
This will cause problems to translators who use a character encoding
different from yours. Consider using a pure ASCII msgid instead.
LJUBLJANA ? Zavod RS za zaposlovanje je na svojih spletnih straneh vzpostavil novo rubriko Skupaj do zaposlitve, v okviru katere bodo uporabnikom na voljo primeri dobrih praks oz. uspe�nih zgodb brezposelnih oseb, iskalcev zaposlitve in delodajalcev iz vse Slovenije. Trenutno so v rubriki objavljene tri zgodbe. Svoje izku�nje so javnosti zaupali Branko Ileni�, Darja Avgu�tin�i� in Mojca Rupert.
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
foo.c:3: invalid multibyte sequence
$ file *
copyright-with-utf.po: GNU gettext message catalogue, Unicode text, UTF-8 text, with very long lines (399)
foo.c: ISO-8859 text, with very long lines (799)
foo.c.zip: Zip archive data, at least v2.0 to extract, compression method=deflate
no-copyright-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)
omit-with-utf.po: GNU gettext message catalogue, ASCII text, with very long lines (399)
Hence why using a fake copyright works to get a UTF-8 output. All other modes can parse non-UTF BUT will return some random ASCII-like encoding
Something to consider is to rerun
xgettext
with different parameters in case it fails. The xgettext manual says:Sometimes this will lead to incorrect results (or no results at all) and
xgettext
might be needed to rerun with a different option. One example where fidks fails isutil-linux/fdisk.c
from a recent BusyBox:The culprit here is actually this sequence:
where
xgettext
thinks this might be some UTF-8 character (but, of course, it is not a valid sequence). No output file is generated in this case.https://git.busybox.net/busybox/tree/util-linux/fdisk.c?h=1_35_stable
Another example is the attached file (
lineedit.c
from BusyBox, zipped) where I have replaced a string on line 893.and no output file will be created.
When using the
--from-code
parameter the string will not be correctly extracted, but an output file will be created:It is not ideal, but better than getting no data at all. This could use some refinement.
Please note that this isn't true for all languages according to the
xgettext
manual: