Open verhovsky opened 3 years ago
Looking at the docs, gettext starts the msgid with an empty line on purpose for "better alignment":
Each of untranslated-string and translated-string respects the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences. When the time comes to write multi-line strings, one should not use escaped newlines. Instead, a closing quote should follow the last character on the line to be continued, and an opening quote should resume the string at the beginning of the following PO file line. For example:
msgid "" "Here is an example of how one might continue a very long string\n" "for the common case the string represents multi-line output.\n"
In this example, the empty string is used on the first line, to allow better alignment of the H from the word ‘Here’ over the f from the word ‘for’. In this example, the msgid keyword is followed by three strings, which are meant to be concatenated. Concatenating the empty string does not change the resulting overall string, but it is a way for us to comply with the necessity of msgid to be followed by a string on the same line, while keeping the multi-line presentation left-justified, as we find this to be a cleaner disposition. The empty string could have been omitted, but only if the string starting with ‘Here’ was promoted on the first line, right after msgid.2 It was not really necessary either to switch between the two last quoted strings immediately after the newline ‘\n’, the switch could have occurred after any other character, we just did it this way because it is neater.
https://www.gnu.org/software/gettext/manual/gettext.html#PO-Files
This is due to the differences between python standard library textwrap module and the gettext wrapper. I don't plan to rewrite from scratch a text wrapper ! If someone comes with a solution for this, please make a pull request.
This issue of the Python bug tracker seems related.
Is there a xgettext command that re-textwraps a po file into the same format? That would work for my usecase, I could just call that command after every time I save a file with polib.
@verhovsky:
$ msgcat input.po -o output.po
or
$ msgcat input.po -w78 -o output.po
should do the trick.
For reference, it looks like in the gettext source code, the width is set here
and then gets passed to ulc_width_linebreaks
and then the code does some stuff with the result.
ulc_width_linebreaks
is defined in gnulib here
documented here
https://www.gnu.org/software/libunistring/manual/html_node/unilbrk_002eh.html
Ideally, someone would make Python bindings for unilbrk
and then re-implement the rest of the code in write-po.c
in Python, but I can confirm that using msgcat
after every po.save()
works just as well. The only caveat is that I had to set the CHARSET
of the .po file (otherwise msgcat
errors), which I did like like this:
sed -i 's/charset=CHARSET/charset=UTF-8/' messages.po
Then you can just do this:
import subprocess
from polib import pofile
filename = "messages.po"
subprocess.run(["bash", "-c", "command msgcat"], check=True) # check that we have the msgcat command available
po = pofile(filename, encoding="utf-8")
po.save()
subprocess.run(["msgcat", filename, "-o", filename], check=True)
PS. -w78
is not correct, I got different results from the original, I think -w79
is the right one but not passing it all works as well.
I think that the easiest solution to this problem is to generate Python bindings for the Rust crate textwrap which offers multiplatform Unicode Line Breaking wrapping.
I think that the easiest solution to this problem is to generate Python bindings for the Rust crate textwrap which offers multiplatform Unicode Line Breaking wrapping.
As a user of a pure python library, I would disagree on this. Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.
Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.
You can serve wheels for a lot of platforms, is very easy. In fact, I'm thinking on rewriting polib entirely in Rust, it would optimize the library used from Python.
Adding a dependency to a library which requires a recompilation it's not exactly the easiest solution.
You can serve wheels for a lot of platforms, is very easy. In fact, I'm thinking on rewriting polib entirely in Rust, it would optimize the library used from Python.
I'm not sure I understand your point, you're talking of a rust rewrite, how in the earth can this solve this particular issue?!
I'm not sure I understand your point, you're talking of a rust rewrite, how in the earth can this solve this particular issue?!
Just suggested to write Python bindings for Rust crate textwrap for its usage in polib to solve this problem. Is a very easy solution that does not involve compilation at installation time.
I understand that some of you are not receptives to the idea, so since I use polib in several of my projects I'm thinking of rewriting it in Rust creating bindings for Python, which gives me, in addition to solving this problem, a considerable performance improvement.
xgettext -c somefile.c
can produce a po file with an entry like this:but then if you just re-save it using
from polib import pofile; pofile("messages.po", encoding="utf-8").save()
, it will wrap it differently:polib
(really Python'stextwrap.wrap()
method) puts the space at the beginning of the second line instead of at the end of the first.This is an issue because using a command line tool that uses
polib
on a Django project would cause churn in the git history as it shuffles spaces between lines.I don't know why it's starting msgid with an empty string.
Here's a bash session showing the issue with a code sample that causes xgettext to produce an entry that starts with an empty string: