Adjust line wrapping algorithm to be closer to the GNU gettext tooling

Kingdutch commented 4 years ago

In our project we use both tools for different parts of our translation pipeline. Unfortunately the inconsistency creates very noisy commits. Bringing them closer together will make reviewing changes made with tools relying on gettext-parser a lot easier to review.

Examples from a diff Below are examples of a diff. On the left is a .po file created by msginit and manipulated by GNU gettext tools. On the right side is the output of a call to gettext-parser's po.compile function.

afbeelding

The string @count Members are selected is 65 characters long, so it's not wrapped. However the entire line msgid_plural "@count Members are selected" is 80 characters long which seems to cause the GNU tools to wrap the line.

The same happens for afbeelding

afbeelding

Books have a built-in hierarchical navigation. Use for handbooks or is 77 characters long. The GNU tools seem to allow this with the space being on the first line as the 77th character. gettext-parser will wrap one space earlier.

This seems to occur more often than expected. For example in the following lines as well. afbeelding

afbeelding It seems the GNU tools have a bit more knowledge of HTML while the gettext-parser treats href=\"\">[social_mentions:mentioned_user]</a> as an unbreakable string, the GNU break makes more sense because <a href=\"\"> together is more important.

A similar case for this can be found below, where the space is found to be a better breaking point than within a tag.

afbeelding

I also saw the following which actually suggests that the GNU tools allow 77 characters on a line so it may be a better default than 76. I couldn't find the GNU tools' line-break algorithm so I'm not sure whether they special case spaces and dots or just count to 77. afbeelding

Open Social Branding will be replaced by site name (and slogan if available). is exactly 77 characters.

smhg commented 4 years ago

Thank you for your extensive comparison! Always a pleasure to see issues reported this way!

I'm all for matching gettext tools in every possible way.

That said, based on your HTML example, this might go deep.

Maybe we can wrap the C library? I don't know what the recommended approach is these days, so any input would be appreciated. Something like emscripten maybe? Browser support is important.

Kingdutch commented 4 years ago

Cool, yeah I tried searching around to see if I can figure out how this logic is implemented in the gettext tools but was unsuccesful. Perhaps someone else has experience there and can chime in.

I don't think wrapping the C library is an option if you want browser support (and I think it takes away some of the appeal of this library ^^).

One approach to this would be to brute force it. The examples in the start of this post could be added as test cases. Some testing tool could then run the same texts through msgfmt (I believe that simply formats a .po file right? Otherwise replace msgfmt in this text with another gettext tool ^^) and parse/compile with this library and diff the two.

Alternatively a framework like jest could be used together with Snapshot testing. The snapshot could be the output from the msgfmt tool. Then Jest can tell us the difference and we can go from there. This means we don't have to install the gettext tooling in the test environment but can use pure Javascript. As a bonus it would help spot regressions from the formatting made by future changes once the algorithm has improved 😋

One thing that could expose is that maybe changing from 76 to 77 as the default wrapping length already makes a good difference.

EDIT: I see this library already uses chai and mocha ('m a Jest fanboy sorry) which has this package for snapshotting: https://www.npmjs.com/package/mocha-chai-snapshot

smhg commented 4 years ago

If you see low-hanging fruit (any fruit, really ;)), please send in a PR. There are some line-folding tests which you can add yours to.

In general: Testing is a secondary issue I think. The hard part is getting a 1-on-1 match with gettext without reinventing the wheel (e.g. understanding HTML,...). That would need a solution first, if one exists.

vHeemstra commented 2 years ago

I'm not sure, but browsing through the GNU gettext git, it seems the wrap() method handles line breaking, using the ulc_width_linebreaks() helper method to find the best breaking position.

In the wrap() method, page_width seems to be the maximum number of characters/columns per line and it is used as a starting point. Indentation, opening/closing characters, etc, are subtracted and added to keep track of the current possible room/width on that line. (see here)

wrap() is used in many print methods, including message_print().

Maybe this helps shed a little light on what's happening. :)

smhg / gettext-parser

Adjust line wrapping algorithm to be closer to the GNU gettext tooling #51