Confused po4a-gettextize by byte order markers and asciidoc

petterreinholdtsen commented 2 years ago

The parsing of asciidoc files seem to be very confused when it find a byte order mark in the text file. The following demonstrate the problem. The two text files contain one text block each, but po4a-gettextize claim there is no text block in one of them. The two files are attached as a tarball, asciidoc-with-bom.tar.gz.

% file a_e*
a_en.adoc: ASCII text
a_es.adoc: UTF-8 Unicode (with BOM) text
% cat a_e*
:lang: en
:lang: es
% LANG=C po4a-gettextize -f AsciiDoc -M UTF-8 -m a_en.adoc -l a_es.adoc
Use of uninitialized value $newchar in substitution iterator at /usr/share/perl5/Locale/Po4a/Po.pm line 1619.
po4a gettextize: Original has less strings than the translation (0<1). Please fix it 
               by removing the extra entry from the translated file. You may need an 
               addendum (cf po4a(7)) to reput the chunk in place after 
               gettextization. A possible cause is that a text duplicated in the 
               original is not translated the same way each time. Remove one of the 
               translations, and you're fine.

The gettextization failed (once again). Don't give up, gettextizing is a subtle art, but this is only needed once to convert a project to the gorgeous luxus offered by po4a to translators.
Please refer to the po4a(7) documentation, the section "HOWTO convert a pre-existing translation to po4a?" contains several hints to help you in your task
%

I expected po4a-gettextize to handle byte order marks in text files, as there are several text editors on Windows that insert them when saving files.

mquinson commented 2 years ago

Hello, thanks for this report.

What would you advise? To simply ignore these markers, or to try to restore them afterward? I suspect that ignoring is the right approach here, but I'm not sure.

Thanks,

petterreinholdtsen commented 2 years ago

[Martin Quinson]

Hello, thanks for this report.

What would you advise? To simply ignore these markers, or to try to restore them afterward? I suspect that ignoring is the right approach here, but I'm not sure.

I understand byte order markers to be there to identify the input byte order, and would ignore them on output as long as the output is UTF-8 or another byte based charset. It could be useful to include them for UTF-16 output, but in the UTF-8 case they seem to be simply fluff. I saw someone argue that BOM can be used to identify UTF-8, but I am not sure I buy that argument.

-- Happy hacking Petter Reinholdtsen

mquinson commented 2 years ago

I had another look at the source code, and I get the feeling that we are handling file encoding wrongly in po4a. I am considering using File::BOM all over the place to fix it, but that's quite an intrusive change that requires time.

It would probably also be possible to hack something by adding something along these lines in Transtractor::read, but I'm not 100% confident that this will be enough. It somehow feels like hiding the issue instead of fixing it...

mquinson / po4a

Confused po4a-gettextize by byte order markers and asciidoc #333