Open Kingdutch opened 4 years ago
Thank you for your extensive comparison! Always a pleasure to see issues reported this way!
I'm all for matching gettext tools in every possible way.
That said, based on your HTML example, this might go deep.
Maybe we can wrap the C library? I don't know what the recommended approach is these days, so any input would be appreciated. Something like emscripten maybe? Browser support is important.
Cool, yeah I tried searching around to see if I can figure out how this logic is implemented in the gettext tools but was unsuccesful. Perhaps someone else has experience there and can chime in.
I don't think wrapping the C library is an option if you want browser support (and I think it takes away some of the appeal of this library ^^).
One approach to this would be to brute force it. The examples in the start of this post could be added as test cases. Some testing tool could then run the same texts through msgfmt
(I believe that simply formats a .po file right? Otherwise replace msgfmt
in this text with another gettext tool ^^) and parse
/compile
with this library and diff the two.
Alternatively a framework like jest
could be used together with Snapshot testing. The snapshot could be the output from the msgfmt
tool. Then Jest can tell us the difference and we can go from there. This means we don't have to install the gettext tooling in the test environment but can use pure Javascript. As a bonus it would help spot regressions from the formatting made by future changes once the algorithm has improved 😋
One thing that could expose is that maybe changing from 76 to 77 as the default wrapping length already makes a good difference.
EDIT: I see this library already uses chai and mocha ('m a Jest fanboy sorry) which has this package for snapshotting: https://www.npmjs.com/package/mocha-chai-snapshot
If you see low-hanging fruit (any fruit, really ;)), please send in a PR. There are some line-folding tests which you can add yours to.
In general: Testing is a secondary issue I think. The hard part is getting a 1-on-1 match with gettext without reinventing the wheel (e.g. understanding HTML,...). That would need a solution first, if one exists.
I'm not sure, but browsing through the GNU gettext git, it seems the wrap()
method handles line breaking, using the ulc_width_linebreaks()
helper method to find the best breaking position.
In the wrap()
method, page_width
seems to be the maximum number of characters/columns per line and it is used as a starting point. Indentation, opening/closing characters, etc, are subtracted and added to keep track of the current possible room/width on that line. (see here)
wrap()
is used in many print methods, including message_print()
.
Maybe this helps shed a little light on what's happening. :)
In our project we use both tools for different parts of our translation pipeline. Unfortunately the inconsistency creates very noisy commits. Bringing them closer together will make reviewing changes made with tools relying on gettext-parser a lot easier to review.
Examples from a diff Below are examples of a diff. On the left is a .po file created by msginit and manipulated by GNU gettext tools. On the right side is the output of a call to gettext-parser's
po.compile
function.The string
<b><em class=\"placeholder\">@count</em> Members</b> are selected
is 65 characters long, so it's not wrapped. However the entire linemsgid_plural "<b><em class=\"placeholder\">@count</em> Members</b> are selected"
is 80 characters long which seems to cause the GNU tools to wrap the line.The same happens for
<em>Books</em> have a built-in hierarchical navigation. Use for handbooks or
is 77 characters long. The GNU tools seem to allow this with the space being on the first line as the 77th character. gettext-parser will wrap one space earlier.This seems to occur more often than expected. For example in the following lines as well.
It seems the GNU tools have a bit more knowledge of HTML while the gettext-parser treats
href=\"\">[social_mentions:mentioned_user]</a>
as an unbreakable string, the GNU break makes more sense because<a href=\"\">
together is more important.A similar case for this can be found below, where the space is found to be a better breaking point than within a tag.
I also saw the following which actually suggests that the GNU tools allow 77 characters on a line so it may be a better default than 76. I couldn't find the GNU tools' line-break algorithm so I'm not sure whether they special case spaces and dots or just count to 77.
Open Social Branding will be replaced by site name (and slogan if available).
is exactly 77 characters.