Closed stevecotton closed 9 months ago
Hi Steve, why not simply update the po4a documentation with adding the recommendation of adding those two options like you did? --master-charset UTF-8 --localized-charset UTF-8
I was going to refer to msgmerge's docs about how they want to preserve whichever charset the translator chooses, and that your suggestion would force UTF-8 instead. However, I've just found an interesting behavior of msgmerge 0.21 (as shipped in Debian stable), which looks like it'll need some additional bugs filed; I'll get to that, but not tonight.
When running msgmerge -U temp.po somefile.pot
:
I have yet another situation with msgmerge. I'm trying to update a UTF-8 PO file against a iso-8859 POT file, and it mangles the non-ascii chars:
$ ls
iso8859.pot iso8859.up.po
$ file *
iso8859.pot: GNU gettext message catalogue, ISO-8859 text
iso8859.up.po: GNU gettext message catalogue, Unicode text, UTF-8 text
$ grep charset= *
iso8859.pot:"Content-Type: text/plain; charset=ISO-8859-1\n"
iso8859.up.po:"Content-Type: text/plain; charset=UTF-8\n"
$ iconv -f UTF-8 -t Latin1 iso8859.up.po -o /dev/null
$ iconv -f iso-8859-1 -t UTF-8 iso8859.pot -o /dev/null
The iconv commands do not output any error, proving that the file encoding matches the declared charset in the header (and the file
guess). Let's now try to msgmerge the PO file.
$ msgmerge iso8859.up.po iso8859.pot
# Language up translations for po package
# Copyright (C) 2020 Free Software Foundation, Inc.
# This file is distributed under the same license as the po package.
# Automatically generated, 2020.
#
msgid ""
msgstr ""
"Project-Id-Version: po 4a\n"
"POT-Creation-Date: 2024-01-02 02:22+0100\n"
"PO-Revision-Date: 2020-04-09 17:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: up\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
#. type: =head1
#: iso8859.pod:1
#, fuzzy
iso8859.up.po:21: invalid multibyte sequence
iso8859.up.po:21: invalid multibyte sequence
msgid "Ttulo de prueba"
msgstr "TÍTULO DE PRUEBA"
#. type: textblock
#: iso8859.pod:3
#, fuzzy
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
msgid "blbleble llalala"
msgstr "BLÈBLEBLE LÁLALALA"
All non-ascii chars of the msgids get mangled for some reason (the 'invalid multibyte sequence' lines are part of the msgmerge stderr, not of the actual file content). I'm puzzled. I'm using msgmerge 0.21 from Debian testing.
Any help would be really welcome here.
The files I used in this test: iso8859.up.po.txt iso8859.pot.txt
Submitted as https://savannah.gnu.org/bugs/index.php?65104
I guess that we should force UTF-8 on PO and POT files to stay safe. Do you have a better idea?
The quickstart guide in the po4a(1) manpage says "Simply create an empty file with the .pot extension in the specified po_directory (e.g. man/po4a/foo.pot), and po4a will fill it with the expected content."
I assumed that .po files could be created in the same way, by creating an empty one and running
po4a po4a.cfg
. Doing that fills the .po file, but silently strips non-ASCII characters out of the msgids as it does so. This seems to be a deliberate feature of gettext's msgmerge - if it's given an empty .po file and a UTF-8 .pot file, it assumes that the .po file should be ASCII, and strips letters with umlauts out of the msgids. Running it directly gives warnings about that, but they aren't shown when running it via po4a.I have a German source file, and enabled UTF-8 in the .cfg file:
I'm not submitting a patch, as I'm not sure which way you'd prefer to handle it, but suggest either checking for empty files or adding "Don't create empty .po files, as these may cause the wrong charset to be used. Instead use the translators' tools to create a .po from the .pot." to the quickstart.
Debian bug #1022216 seems related, but is using
po4a-updatepo
.I'm using Debian Bookworm with gettext version 0.21-12, and have checked that the bug is still reproducible with po4a c9f5cf97ebc6915d7d9c4a90707a98c8cb6ad7a2.