Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids

stevecotton commented 1 year ago

The quickstart guide in the po4a(1) manpage says "Simply create an empty file with the .pot extension in the specified po_directory (e.g. man/po4a/foo.pot), and po4a will fill it with the expected content."

I assumed that .po files could be created in the same way, by creating an empty one and running po4a po4a.cfg. Doing that fills the .po file, but silently strips non-ASCII characters out of the msgids as it does so. This seems to be a deliberate feature of gettext's msgmerge - if it's given an empty .po file and a UTF-8 .pot file, it assumes that the .po file should be ASCII, and strips letters with umlauts out of the msgids. Running it directly gives warnings about that, but they aren't shown when running it via po4a.

I have a German source file, and enabled UTF-8 in the .cfg file:

[po_directory] po
[type: text] 02_Beispiel.md en:en/02_Example.md
[options] --master-charset UTF-8 --localized-charset UTF-8

I'm not submitting a patch, as I'm not sure which way you'd prefer to handle it, but suggest either checking for empty files or adding "Don't create empty .po files, as these may cause the wrong charset to be used. Instead use the translators' tools to create a .po from the .pot." to the quickstart.

Debian bug #1022216 seems related, but is using po4a-updatepo.

I'm using Debian Bookworm with gettext version 0.21-12, and have checked that the bug is still reproducible with po4a c9f5cf97ebc6915d7d9c4a90707a98c8cb6ad7a2.

ciampix commented 9 months ago

Hi Steve, why not simply update the po4a documentation with adding the recommendation of adding those two options like you did? --master-charset UTF-8 --localized-charset UTF-8

stevecotton commented 9 months ago

I was going to refer to msgmerge's docs about how they want to preserve whichever charset the translator chooses, and that your suggestion would force UTF-8 instead. However, I've just found an interesting behavior of msgmerge 0.21 (as shipped in Debian stable), which looks like it'll need some additional bugs filed; I'll get to that, but not tonight.

When running msgmerge -U temp.po somefile.pot:

non-existent .po file: msgmerge refuses to create the file
empty .po file, or no header: msgmerge does not add a header, assumes ASCII, and mangles UTF-8
.po file header says ASCII: msgmerge changes the header to say UTF-8, and writes UTF-8
.po file header says GB2312: msgmerge changes the header to say UTF-8, and writes UTF-8

mquinson commented 9 months ago

I have yet another situation with msgmerge. I'm trying to update a UTF-8 PO file against a iso-8859 POT file, and it mangles the non-ascii chars:

$ ls
iso8859.pot  iso8859.up.po
$ file *
iso8859.pot:   GNU gettext message catalogue, ISO-8859 text
iso8859.up.po: GNU gettext message catalogue, Unicode text, UTF-8 text
$ grep charset= *
iso8859.pot:"Content-Type: text/plain; charset=ISO-8859-1\n"
iso8859.up.po:"Content-Type: text/plain; charset=UTF-8\n"
$ iconv -f UTF-8 -t Latin1 iso8859.up.po -o /dev/null
$ iconv -f iso-8859-1 -t UTF-8 iso8859.pot -o /dev/null

The iconv commands do not output any error, proving that the file encoding matches the declared charset in the header (and the file guess). Let's now try to msgmerge the PO file.

$ msgmerge iso8859.up.po iso8859.pot
# Language up translations for po package
# Copyright (C) 2020 Free Software Foundation, Inc.
# This file is distributed under the same license as the po package.
# Automatically generated, 2020.
#
msgid ""
msgstr ""
"Project-Id-Version: po 4a\n"
"POT-Creation-Date: 2024-01-02 02:22+0100\n"
"PO-Revision-Date: 2020-04-09 17:33+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: up\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#. type: =head1
#: iso8859.pod:1
#, fuzzy
iso8859.up.po:21: invalid multibyte sequence
iso8859.up.po:21: invalid multibyte sequence
msgid "Ttulo de prueba"
msgstr "TÍTULO DE PRUEBA"

#. type: textblock
#: iso8859.pod:3
#, fuzzy
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
iso8859.up.po:26: invalid multibyte sequence
msgid "blbleble llalala"
msgstr "BLÈBLEBLE LÁLALALA"

All non-ascii chars of the msgids get mangled for some reason (the 'invalid multibyte sequence' lines are part of the msgmerge stderr, not of the actual file content). I'm puzzled. I'm using msgmerge 0.21 from Debian testing.

Any help would be really welcome here.

mquinson commented 9 months ago

The files I used in this test: iso8859.up.po.txt iso8859.pot.txt

mquinson commented 9 months ago

Submitted as https://savannah.gnu.org/bugs/index.php?65104

mquinson commented 9 months ago

I guess that we should force UTF-8 on PO and POT files to stay safe. Do you have a better idea?

mquinson / po4a

Creating an empty .po file causes non-ASCII characters to be silently discarded from msgids #442