mquinson / po4a

Maintain the translations of your documentation with ease (PO for anything)
http://po4a.org/
GNU General Public License v2.0
127 stars 62 forks source link

would like po4a to support *roff `\c` escape sequence #527

Open g-branden-robinson opened 1 month ago

g-branden-robinson commented 1 month ago

I was reading Locale::Po4a::Man(3pm) today and happened across the following language.

To summarise this section, keep simple, and don’t try to be clever while authoring your man pages.

Excellent advice!

A lot of things are possible in nroff, and not supported by this parser. For example, don’t try to mess with \c to interrupt the text processing (like 40 pages on my box do).

Ouch!

As groff maintainer, I would like to petition the po4a project to support \c.

I am aware that historically, approximately no one has been able to clearly explain what \c does (how does it both "interrupt" and "continue", depending on which manual you read?), which may explain why po4a's parser has proven reluctant to apply any interpretation to it.

Here is the full explanation, from groff's Texinfo manual:

 -- Escape sequence: \c
 -- Register: \n[.int]
     '\c' continues an output line.  Nothing after it on the input line
     is formatted.  In contrast to '\<RET>', a line after '\c' remains a
     new input line, so a control character is recognized at its
     beginning.  The visual results depend on whether filling is
     enabled; see *note Manipulating Filling and Adjustment::.

        * If filling is enabled, a word interrupted with '\c' is
          continued with the text on the next input text line, without
          an intervening space.

               This is a te\c
               st.
                   => This is a test.

        * If filling is disabled, the next input text line after '\c' is
          handled as a continuation of the same input text line.

               .nf
               This is a \c
               test.
                   => This is a test.

     An intervening control line that causes a break overrides '\c',
     flushing out the pending output line in the usual way.

     The '.int' register contains a positive value if the most recently
     formatted text was continued with '\c'; this datum is associated
     with the environment (*note Environments::).(2)  (*note Line
     Continuation-Footnote-2::)

In man page applications, its interpretation should be simple:

There are two cases: filling enabled and filling disabled.

If filling is enabled, \c means "don't put a space on the output when you encounter the next newline".

If filling is disabled, \c means "don't put a line break on the output when you encounter the next newline".

Since groff 1.22.4 (December 2018), the _groffman(7) page has explicitly advised the use of \c in certain circumstances. (In groff 1.23.0, much of this guidance migrated to the new _groff_manstyle(7) page.) We can't escape \c; we've tried. The only alternative is introducing a bunch of new macros that mostly do the same things as existing ones, bloating the man macro language, making it harder to learn, and beginning a transition that we can be sure will never actually end due to 45 years of inertia possessed by the existing macros.

I'd like to know how I can help make this happen.

(If you're curious what brought me here, well, (1) Helge Kreutzmann told me I could find a comprehensive list of tags used by pod2man in po4a's documentation, and (2) I stumbled across this procps-ng commit.)

mquinson commented 1 month ago

Hello Branden,

no need to petition us, we're already convinced :) So far, all my attempt to implement a sufficient support for \c failed, and Helge keeps reporting the issues remaining in my several attempts. See e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036826#85 which is my current TODO list on that topic (you already contributed to that BR on Debian, by the way, thanks for that).

Your new hindsight is very welcome here. I will try to look again at our code with your text in mind. The thing is that we don't have no notion of filling which could be enabled or disabled. Most of the time, we don't need such thing, as we simply try to extract the content strings to a PO file, not render the whole file. Of course there is a trick here: we may need to understand some of it, as we try to ease the life of translators by replacing inline formatting (bold, italic) from the *roff syntax to an arguably easier syntax inspired from the POD format (e.g., B and I).

If you feel like, you could grep for \c on our implementation: https://github.com/mquinson/po4a/blob/master/lib/Locale/Po4a/Man.pm Please be patient with us, we never pretended to implement a full groff parser, only to extract/inject some sentences from/into an otherwise unmodified source file...

g-branden-robinson commented 1 month ago

Hi Martin,

Good to hear from you, and excellent to hear back so quickly!

no need to petition us, we're already convinced :) So far, all my attempt to implement a sufficient support for \c failed, and Helge keeps reporting the issues remaining in my several attempts. See e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036826#85 which is my current TODO list on that topic (you already contributed to that BR on Debian, by the way, thanks for that).

I see! I had forgotten that exchange, except for being rather firm with Bjarni.

Your new hindsight is very welcome here. I will try to look again at our code with your text in mind. The thing is that we don't have no notion of filling which could be enabled or disabled.

Well, you might not need such a notion. I'd like to see how far you can get without it.

Most of the time, we don't need such thing, as we simply try to extract the content strings to a PO file, not render the whole file.

Right. One thing I'm curious about is how you decide where the boundaries of a translatable string are. When extracting string literals from a programming language, this might not be too hard--look for double quotes, and understand some things like C- and Perl-style backslash-escape conventions, and also those languages' rules for string literal catenation.

In a man(7) document, things may be a bit more interesting.

Of course there is a trick here: we may need to understand some of it, as we try to ease the life of translators by replacing inline formatting (bold, italic) from the *roff syntax to an arguably easier syntax inspired from the POD format (e.g., B and I).

Acknowledged.

If you feel like, you could grep for \c on our implementation: https://github.com/mquinson/po4a/blob/master/lib/Locale/Po4a/Man.pm

Please be patient with us, we never pretended to implement a full groff parser,

No worries. I would not ask or expect you to. While in theory, man pages can leverage the full power of the troff typesetting system (and groff extensions thereto), in practice they limit their composition to a small subset of that language a few nines of the time.

For groff 1.22.4, mandoc(1) maintainer Ingo Schwarze and I collaborated on making the "Portability" subsection of the _groffman(7) page a more useful guide for man(7) authors and a sort of mutually agreed minimal set of groff + man features that the groff would recommend for composition, so that mandoc, a non-roff formatter, would, like po4a, not have to take on the gigantic task of interpreting the full language.

Ingo is more of a purist than I am; I feel that man pages should exercise formatter features if necessary to achieve satisfactory typesetting, but when we keep in mind that most man page perusal is on a terminal (or in HTML scraped and converted from terminal output!), the exercise of such features should, most of the time, be safely ignorable by non-typesetters.

For example, in groff's man pages I make frequent recourse to br and ne requests within if requests to avoid stranding single lines of paragraphs at the end of a page. But nothing breaks for mandoc's purposes, nor for po4a's, if these are completely ignored. Neither of your projects cares about where page breaks happen. Only typesetters do.

only to extract/inject some sentences from/into an otherwise unmodified source file...

Right. I've never in my life played with po4a before, but it sounds like I should give it a whirl on a man page or two to see what it makes of them.


I do have one crazy idea that might be good for a Google Summer of Code or similar project:

It's not a very well known fact that groff supports output in more than one format. And I don't mean PostScript, PDF, or HTML--groff handles all of these the same, writing out a document in a page description language that doesn't have a well accepted name but which I call "grout". It's a descendant of the troff output format described by Kernighan in the Bell Labs CSTR documents # 97 and # 76 (1992 revision), which I similarly call "trout". Programs called output drivers, like DWB troff's dpost, or groff's grodvi, grolbp, grops, gropdf, and grotty, translate that page description language into another file format or byte stream that a (possibly emulated) hardware device is prepared to consume.

But that's not what I'm talking about. As a language compiler, groff builds lists of "nodes", very much like the abstract syntax tree that is taught in computer science classes. Since day one it has supported not one but two output formats: grout, and "approximate output", which is what you see when you run groff -a on a document. For about 25 years it has also supported "suppressed output", which is sort of a hack that was put in place to surmount certain problems with HTML generation that need not concern us here. The important fact is that there is not a tight coupling between nodes and their rendering.

Here's a description of groff -a output. It closely follows a Unix troff feature.

     -a       Generate a plain text approximation of the typeset output.
              The read‐only register .A is set to 1.  This option
              produces a sort of abstract preview of the formatted
              output.

              •  Page breaks are marked by a phrase in angle brackets;
                 for example, “<beginning of page>”.

              •  Lines are broken where they would be in formatted
                 output.

              •  Vertical motion, apart from that implied by a break, is
                 not represented.

              •  A horizontal motion of any size is represented as one
                 space.  Adjacent horizontal motions are not combined.
                 Supplemental inter‐sentence space (configured by the
                 second argument to the .ss request) is not represented.

              •  A special character is rendered as its identifier
                 between angle brackets; for example, a hyphen appears
                 as “<hy>”.

              The above description should not be considered a
              specification; the details of -a output are subject to
              change.

And here's an example of what that looks like:

$ nroff -a -man ~/ncurses-HEAD/share/man/man3/beep.3ncurses 
<beginning of page>
beep(3NCURSES) Library calls beep(3NCURSES)
NAME 
 beep, flash <-> ring the (visual) bell of the terminal with curses
SYNOPSIS 
 #include <ncursesw/curses.h>
 int beep(void);
 int flash(void);
DESCRIPTION 
 beep and flash alert the terminal user: the former by sounding the termi<hy>
 nal's audible alarm, and the latter by visibly attracting attention. Com<hy>
 monly, a terminal implements a visual bell by momentarily reversing the
 character foreground and background colors on the entire display; even a
 monochrome device can do this. These functions each attempt the other
 alert type if the one requested is unavailable. If neither is available,
 curses performs no action. Nearly all terminals have an audible alert
 mechanism such as a bell or piezoelectric buzzer, but only some can flash
 the screen.
RETURN VALUE 
 These functions return OK on success and ERR on failure.
 In ncurses, beep and flash return OK if the terminal type supports the cor<hy>
 responding capability: bell (bel) for beep and flash_screen (flash) for
 flash. Otherwise they return ERR.
EXTENSIONS 
 In ncurses, these functions can return ERR.
PORTABILITY 
 X/Open Curses, Issue 4 describes these functions. It specifies no error
 conditions for them.
 On SVr4 curses, they always return OK, and X/Open Curses specifies them as
 doing so.
HISTORY 
 beep and flash appeared in SVr2 (1984).
SEE ALSO 
 ncurses(3NCURSES), terminfo(5)
ncurses 6.5 2024-07-20 beep(3NCURSES)

You may anticipate where I'm going with this.

One could write a "pod emitter" output class. Like "approximate" (or "ascii" [sic]) output, its tprint member functions for node types that it didn't support (couldn't represent) would be empty. But one thing a node does know is which font is selected to write the current glyph.

It also knows where the sentence boundaries are (assuming the input was not written to conceal this information), so it could start a new output line when encountering one.

So why parse man or try to guess where the font face changes are when you could have groff tell you, with perfect knowledge?

Just wanted to put that idea out there. And it would be another motivator for superseding -a with an argument-taking option, -A I would think, that would give us flexibility to support several such output formats in the future.

mquinson commented 1 month ago

Actually, that'd be more than perfect. Some of the formats handled by po4a go this way: we don't parse the input ourselves, but interact with the relevant tool. We certainly prefer when it goes that way.

I'm wondering: how would the string reinjection work in your idea? We'd provide the translated string back to groff, and it'd write the source file back with the translated content?

That would be great, for sure. Ways more robust than our current attempt (which, I must say, works surprising well for most existing man pages. It started as a joke and revealed actually quite usable...)

Thanks