Multiline msgids aren't wrapped the same way that xgettext wraps them

verhovsky commented 3 years ago

xgettext -c somefile.c can produce a po file with an entry like this:

msgid ""
"this is a long piece of text that should wrap on to multiple lines this is a "
"long piece of text that should wrap on to multiple lines"
msgstr ""

but then if you just re-save it using from polib import pofile; pofile("messages.po", encoding="utf-8").save(), it will wrap it differently:

msgid ""
"this is a long piece of text that should wrap on to multiple lines this is a"
" long piece of text that should wrap on to multiple lines"
msgstr ""

polib (really Python's textwrap.wrap() method) puts the space at the beginning of the second line instead of at the end of the first.

This is an issue because using a command line tool that uses polib on a Django project would cause churn in the git history as it shuffles spaces between lines.

I don't know why it's starting msgid with an empty string.

Here's a bash session showing the issue with a code sample that causes xgettext to produce an entry that starts with an empty string:

$ mkdir /tmp/polib_test
$ cd /tmp/polib_test
$ cat > test.c
main( ) {
    printf(gettext("this is a long piece of text that should wrap on to multiple lines this is a long piece of text that should wrap on to multiple lines"))
}
$ xgettext -c test.c 
$ ls
messages.po  test.c
$ cat messages.po 
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2020-11-12 09:24-0500\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: 8bit\n"

#: test.c:2
#, c-format
msgid ""
"this is a long piece of text that should wrap on to multiple lines this is a "
"long piece of text that should wrap on to multiple lines"
msgstr ""
$ cp messages.po original.po
$ python3
Python 3.9.0+ (default, Oct 19 2020, 09:51:18) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from polib import pofile
>>> po = pofile("messages.po", encoding="utf-8")
>>> po.save()
>>> 
$ diff original.po messages.po 
23,24c23,24
< "this is a long piece of text that should wrap on to multiple lines this is a "
< "long piece of text that should wrap on to multiple lines"
---
> "this is a long piece of text that should wrap on to multiple lines this is a"
> " long piece of text that should wrap on to multiple lines"
$ xgettext --version
xgettext (GNU gettext-tools) 0.19.8.1
Copyright (C) 1995-1998, 2000-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Ulrich Drepper.

verhovsky commented 3 years ago

Looking at the docs, gettext starts the msgid with an empty line on purpose for "better alignment":

Each of untranslated-string and translated-string respects the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences. When the time comes to write multi-line strings, one should not use escaped newlines. Instead, a closing quote should follow the last character on the line to be continued, and an opening quote should resume the string at the beginning of the following PO file line. For example:
msgid ""
"Here is an example of how one might continue a very long string\n"
"for the common case the string represents multi-line output.\n"
In this example, the empty string is used on the first line, to allow better alignment of the H from the word ‘Here’ over the f from the word ‘for’. In this example, the msgid keyword is followed by three strings, which are meant to be concatenated. Concatenating the empty string does not change the resulting overall string, but it is a way for us to comply with the necessity of msgid to be followed by a string on the same line, while keeping the multi-line presentation left-justified, as we find this to be a cleaner disposition. The empty string could have been omitted, but only if the string starting with ‘Here’ was promoted on the first line, right after msgid.2 It was not really necessary either to switch between the two last quoted strings immediately after the newline ‘\n’, the switch could have occurred after any other character, we just did it this way because it is neater.

https://www.gnu.org/software/gettext/manual/gettext.html#PO-Files

izimobil commented 3 years ago

This is due to the differences between python standard library textwrap module and the gettext wrapper. I don't plan to rewrite from scratch a text wrapper ! If someone comes with a solution for this, please make a pull request.

mondeja commented 3 years ago

This issue of the Python bug tracker seems related.

verhovsky commented 3 years ago

Is there a xgettext command that re-textwraps a po file into the same format? That would work for my usecase, I could just call that command after every time I save a file with polib.

izimobil commented 3 years ago

@verhovsky:

$ msgcat input.po -o output.po or $ msgcat input.po -w78 -o output.po

should do the trick.

verhovsky commented 3 years ago

For reference, it looks like in the gettext source code, the width is set here

https://git.savannah.gnu.org/cgit/gettext.git/tree/gettext-tools/src/write-po.c?id=cd861ce28d9c2bb98c05ff8b5580bec2c805d4bf#n1007

and then gets passed to ulc_width_linebreaks

https://git.savannah.gnu.org/cgit/gettext.git/tree/gettext-tools/src/write-po.c?id=cd861ce28d9c2bb98c05ff8b5580bec2c805d4bf#n1035

and then the code does some stuff with the result.

ulc_width_linebreaks is defined in gnulib here

https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/unilbrk/ulc-width-linebreaks.c;hb=HEAD

documented here

https://www.gnu.org/software/libunistring/manual/html_node/unilbrk_002eh.html

Ideally, someone would make Python bindings for unilbrk and then re-implement the rest of the code in write-po.c in Python, but I can confirm that using msgcat after every po.save() works just as well. The only caveat is that I had to set the CHARSET of the .po file (otherwise msgcat errors), which I did like like this:

sed -i 's/charset=CHARSET/charset=UTF-8/' messages.po

Then you can just do this:

import subprocess
from polib import pofile

filename = "messages.po"

subprocess.run(["bash", "-c", "command msgcat"], check=True)  # check that we have the msgcat command available

po = pofile(filename, encoding="utf-8")
po.save()
subprocess.run(["msgcat", filename, "-o", filename], check=True)

PS. -w78 is not correct, I got different results from the original, I think -w79 is the right one but not passing it all works as well.

mondeja commented 1 year ago

I think that the easiest solution to this problem is to generate Python bindings for the Rust crate textwrap which offers multiplatform Unicode Line Breaking wrapping.

tosky commented 1 year ago

I think that the easiest solution to this problem is to generate Python bindings for the Rust crate textwrap which offers multiplatform Unicode Line Breaking wrapping.

As a user of a pure python library, I would disagree on this. Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.

mondeja commented 1 year ago

Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.

You can serve wheels for a lot of platforms, is very easy. In fact, I'm thinking on rewriting polib entirely in Rust, it would optimize the library used from Python.

izimobil commented 1 year ago

Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.

You can serve wheels for a lot of platforms, is very easy. In fact, I'm thinking on rewriting polib entirely in Rust, it would optimize the library used from Python.

I'm not sure I understand your point, you're talking of a rust rewrite, how in the earth can this solve this particular issue?!

mondeja commented 1 year ago

I'm not sure I understand your point, you're talking of a rust rewrite, how in the earth can this solve this particular issue?!

Just suggested to write Python bindings for Rust crate textwrap for its usage in polib to solve this problem. Is a very easy solution that does not involve compilation at installation time.

I understand that some of you are not receptives to the idea, so since I use polib in several of my projects I'm thinking of rewriting it in Rust creating bindings for Python, which gives me, in addition to solving this problem, a considerable performance improvement.

izimobil / polib

Multiline msgids aren't wrapped the same way that xgettext wraps them #96