Asciidoc: Better message splitting needed

mariobl commented 3 years ago

Hello,

currently I'm working on the translation of man pages written in an Asciidoctor-compatible format of Asciidoc. Although Locale::Po4a::AsciiDoc(3pm) says "Tested successfully on simple AsciiDoc files", it works much better than expected, however, with some constraints... I generate the *.pot file with the following command:

po4a-gettextize -f asciidoc -o compat=asciidoctor -m nsenter.1.adoc -p nsenter.1.pot

The source file contains the following paragraph:

*-a*, *--all*::
  Enter all namespaces of the target process by the default _/proc/[pid]/ns/+++*+++_ namespace paths. The default paths to the target process namespaces may be overwritten by namespace specific options (e.g., *--all --mount*=[_path_]). +
  {nbsp} +
  The user namespace will be ignored if the same as the caller's current user namespace. It prevents a caller that has dropped capabilities from regaining those capabilities via a call to setns(). See *setns*(2) for more details.

The {nbsp} + is needed to get an empty line within an indented part, here an option description. I would actually expect that Po4a splits it in its logical parts, but I get this:

#. type: Plain text
#: nsenter.1.adoc:54
msgid ""
"Enter all namespaces of the target process by the default "
"_/proc/[pid]/ns/+++*+++_ namespace paths. The default paths to the target "
"process namespaces may be overwritten by namespace specific options (e.g., "
"*--all --mount*=[_path_]). + {nbsp} + The user namespace will be ignored if "
"the same as the caller's current user namespace. It prevents a caller that "
"has dropped capabilities from regaining those capabilities via a call to "
"setns(). See *setns*(2) for more details."
msgstr ""

This is annoying, because translators could be confused by the "+ {nbsp} +" – and it becomes more difficult to reuse existing translations based on roff sources. After translating, I get this in the resulting roff file:

Wechselt in alle Namensräume des Zielprozesses mittels der Vorgabe\-Namensräumepfade \fI/proc/[PID]/ns/*\fP. Die Vorgabepfade zum Zielprozessnamensraum können mittels namensraumspezifischer Optionen (z.B. \fB\-\-all \-\-mount\fP=[\fIPfad\fP]) überschrieben werden. + \~ + Die Benutzer\-Namensräume werden ignoriert, falls sie mit dem Namensraum des aktuell Aufrufenden identisch sind. Es verhindert einen Aufrufenden, der Capabilities abgegeben hat, diese Capabilities mit einem Aufruf von setns() wiederzuerlangen. Siehe \fBsetns\fP(2) für weitere Details.

The "man" command renders it with an incorrect formatting:

-a, --all
           Wechselt in alle Namensräume des Zielprozesses mittels der
           Vorgabe-Namensräumepfade /proc/[PID]/ns/*. Die Vorgabepfade zum
           Zielprozessnamensraum können mittels namensraumspezifischer
           Optionen (z.B. --all --mount=[Pfad]) überschrieben werden. +   +
           Die Benutzer-Namensräume werden ignoriert, falls sie mit dem
           Namensraum des aktuell Aufrufenden identisch sind. Es verhindert
           einen Aufrufenden, der Capabilities abgegeben hat, diese
           Capabilities mit einem Aufruf von setns() wiederzuerlangen. Siehe
           setns(2) für weitere Details.

The desired empty line has gone, and I get a + + instead. Unfortunately, there is no other way known to me to get an empty line in indented parts of the source code. Any idea how to solve this problem? Maybe using some definition of parts to be literally taken...?

jnavila commented 3 years ago

Hi,

I can't find the rule that allows what you are doing on https://asciidoc.org/userguide.html or on https://asciidoctor.org/docs/asciidoc-writers-guide/

Either a " +" at end of line is supposed to continue on next line (with preservation of indent), which is basically useless in generic text and is only used for macro definitions. Or the "+" sign is on its own line.

It seems that asciidoc and asciidoctor don't even have the same behaviour with the markup that you have proposed. So my first guess would be that you are using a "non defined" behaviour.

Why not just use the + sign more conventionally at the beginning of the line:

 -a*, *--all*::
  Enter all namespaces of the target process by the default _/proc/[pid]/ns/+++*+++_ namespace paths. The default paths to the target process namespaces may be overwritten by namespace specific options (e.g., *--all --mount*=[_path_]).
+
The user namespace will be ignored if the same as the caller's current user namespace. It prevents a caller that has dropped capabilities from regaining those capabilities via a call to setns(). See *setns*(2) for more details.

po4a does not process correctly " +" at end of lines, but it rightly concatenates everything.

jnavila commented 3 years ago

Sorry, I missed it: https://docs.asciidoctor.org/asciidoc/latest/blocks/hard-line-breaks/

mariobl commented 3 years ago

Thanks for the explanation. Admittedly, I haven't really much experience with Asciidoc(tor). The *.adoc files are imported by Pandoc, and the appearance of the imported files was similar to the asciidoctor.1 man page, so I thought this is a reference... But obviously it isn't.

After some first tests based on your recommendations, this seems to be the right approach. BTW, the {nbsp} + thing comes from here. Although related to HTML output, it worked also in man pages, but only until Po4a did its job...

jnavila commented 3 years ago

There is definitely a bug in po4a. So the proposed markup is a workaround. If you're ok with it, then go on. Otherwise, I'll try to fix it in a hopefully not too distant future.

jnavila commented 3 years ago

That poses the issue of how should this type of markup be processed? A verbatim block? Or several segments? It is not clear on the intent of the author...

mariobl commented 3 years ago

There is definitely a bug in po4a. So the proposed markup is a workaround. If you're ok with it, then go on. Otherwise, I'll try to fix it in a hopefully not too distant future.

I wouldn't treat this as a bug. Po4a doesn't need to ship perfect parsers which cover all the special cases in markup languages. I think tomorrow I will finish the changes and tests in my adoc files, then I will write about the experience and whether anything needs to be fixed.

mariobl commented 3 years ago

OK, I'm finished. Replacing the {nbsp} parts with a single + and adjusting the indentation works so far. Thanks for your proposal.

However, in some cases I was forced to make concessions. Let's take the AUTHORS list as an example:

== AUTHORS

mailto:john.doe@example.com[John Doe] +
mailto:mary.doe@example.com[Mary Doe]

Po4a doesn't recognize the +, puts all in one gettext message and writes the translated (in this case copied) content back to the Asciidoctor files in one line with a + between the mail adresses. Actually it should split the mail addresses into two separate gettext messages and treat the + as a kind of CR/LF character. As a workaround I've written it as follows:

== AUTHORS

mailto:john.doe@example.com[John Doe],
mailto:mary.doe@example.com[Mary Doe]

This way I get multiple mail addresses in one line, separated by commas - and all in a single gettext message. This is not what I want, but it is acceptable for the time being. Another option would be to add an empty line instead, but it would make such lists longer. No problem so far with a few mail addresses, but for example not very helpful in tabbed views which require to don't have an empty line, for better readability.

Let me define a possible parser rule:

If a line ends with "whitespace, plus, line break", then the content needs to be split at this point. The space and plus shouldn't appear in the gettext message.

jnavila commented 3 years ago

I have a problem with this markup. Sometimes, you want to split the lines (just like you did), but some times, you don't (for instance when used for verse). Putting the burden on the translators shoulders is not fair. So, it's quite a catch, without further context...

mariobl commented 3 years ago

After importing the *roff file using Pandoc I got, for example, the following:

*--option* _value_::
  First description line.

The description is indented by two characters. Comparing with asciidoctor.adoc (treating it as reference), it seemed to be OK. But when I tried to add a second description line ...

*--option* _value_::
  First description line.
  +
  Second description line.

... Po4a was confused by the + and recognized it as a normal character, not a special one. So the whole description was treated as one gettext message, writing the + back to the text as is and I got rid of the desired empty line. Removing the indentation solved the problem.

But let's got back to the desired behavior in cases where we don't have the empty line. As described above, I like to have a list of mail adresses without empty lines:

mailto:john.doe@example.com[John Doe] +
mailto:mary.doe@example.com[Mary Doe]

This + after a whitespace at the end of the line causes a normal line break; Asciidoctor produces the correct *roff output therefore. But with Po4a, I get:

mailto:john.doe@example.com[John Doe] + mailto:mary.doe@example.com[Mary Doe]

in the .pot file and

Jon Doe <john.doe@example.com> + Mary Doe <mary.doe@example.com>

*roff output. This leads me to the assumption that the Asciidoctor parser of Po4a doesn't recognize the + as a line break. This applies only to line endings, not to a single + at the beginning of an otherwise empty line, which is correctly parsed.

jnavila commented 3 years ago

The correct indenting is:

*--option* _value_::
  First description line.
+
Second description line.

The + line and following paragraph are not indented.

mariobl commented 3 years ago

The correct indenting is:
*--option* _value_::
  First description line.
+
Second description line.
The + line and following paragraph are not indented.

Yes, but the effect is the same as without indentation. The + causes a line break and an empty line, as desired. But my problem are the + characters at the end of a line, which are not correctly parsed by po4a.

mariobl commented 3 years ago

I've tried another version:

----
mailto:john.doe@example.com[John Doe]
mailto:mary.doe@example.com[Mary Doe]
----

This will be correctly written into the .pot file, but in the resulting man page it renders to:

mailto:john.doe@example.com[John Doe]
mailto:mary.doe@example.com[Mary Doe]

Similarly with .... instead of ___. These citations are useful for code examples, but not for parts containing special formatting which needs to be parsed by Asciidoctor. And even if it would work, all the content would be written in one single gettext message. This is not very helpful for translators, especially in longer lists. Once such a gettext message changes and needs to be updated, it gets harder the longer the list gets.

Putting the burden on the translators shoulders is not fair.

Don't understand what you mean. On the contrary, I try to make it easier for translators by attempting to divide long sections into multiple gettext messages.

jnavila commented 3 years ago

Splitting on each line is not an option if the markup is used to separate lines of a single sentence as described here. In fact, in this later case, I would expect po4a to create a single segment with formal line-wrapping handled:

Rubies are red,
Topazes are blue.

With plus signs removed but line break preserved. When putting it back into the adoc file, the ending " +" are added at the end of all but the last line.

This could also be easily replaced in the source asciidoc by a verse block

Obviously, this use case would not fit with yours.

eevan78 commented 3 years ago

It would be nice to implement that. I also have issues with translating weechat documentation:

`[NOTE] ^(1)^ Name comes from the Debian GNU/Linux distribution, versions and package names may be different in different distributions and versions. + ^(2)^ It is recommended to compile with libncursesw5-dev (the w is important). WeeChat can compile with libncurses5-dev, but it is NOT recommended: you may experience display bugs with wide chars. + ^(3)^ GnuTLS ≥ 3.0.21 is required for IRC SASL authentication with mechanism ECDSA-NIST256P-CHALLENGE.

The following table shows...`

It is from the source .adoc file. After po4a process it, all ends up in one paragraph, without line breaks after +

jnavila commented 3 years ago

As I said the '+' format is difficult to process for po4a because some semantics are missing.

In your case, I think you resorted to some poor man's formating instead of using asciidoc features. What you want is a numbered list in a note. This can be done with an annotated text block with an ordered list:

[NOTE]
====
. Name comes from the Debian GNU/Linux distribution, versions and package
names may be different in different distributions and versions.
. It is recommended to compile with libncursesw5-dev (the w is
important). WeeChat can compile with libncurses5-dev, but it is NOT recommended:
you may experience display bugs with wide chars.
. GnuTLS ≥ 3.0.21 is required for IRC SASL authentication with mechanism
ECDSA-NIST256P-CHALLENGE.
=====

This is semantically more accurate.

eevan78 commented 3 years ago

I understand what you're saying. I'm just translating the source. Unfortunately, I'm not in control of it. I generate the translation and then use Vim's substitute to break the lines after + where needed. Clumsy, but works for these sources. They should be improved for sure.

jnavila commented 3 years ago

You may upstream these changes.

MattBlissett commented 3 years ago

I have an example of + hard line breaks here, where it's not clear what the semantic alternative would be – if there even is one. (AsciiDoctor source, PO4A-made POT, broken result.)

Plenty of other AsciiDoctor syntax makes its way through to translation, so I don't see a problem keeping the + — it just needs the line break that follows to be preserved. That would also be appropriate when it's used for formatting kludges.

jnavila commented 3 years ago

I tend to split markup signs between block level and inline (think "div" and "span" in HTML). The policy applied by po4a is to break up text at block level, but keep the inline markup in the segments. It is up to the translation tool to analyse the inline markup and flag/protect it.

The problem is that the "+" sign is a "linebreak" which is in neither category, being merely a rendering work around and this makes it multipurpose and not manageable with existing logic.

So, your proposition would be : if a "+\n" appears in a paragraph which was not verbatim until now, then

switch to the paragraph to verbatim, do not sub-segment it
remove the lonely "\n"s but keep the "+\n"s
let the translator manage the line breaks (+) in the segment.

Would it match most usages?

MattBlissett commented 3 years ago

Thanks for the quick response.

What you describe was exactly my suggestion, although maybe it's not necessary to have +\n but only \n in the PO output. \n is then changed to +\n when generating the translated AsciiDoc.

I don't have a strong opinion either way.

smoe commented 2 years ago

I just also ran into this issue while translating the LinuxCNC asciidoc documentation with po4a version 0.66. To me it boils to the question if "
" ( +\n in asciidoc) shall allowed in a text or not. I think it should be. I also agree that authors tend not to use the semantics that asciidoc provides for what they then solve with forced newlines, and certainly these issues will be addressed over time as asciidoc skills evolve in the community, but that will take more time than we want the translations to take.

The text

This text should be +
on two lines.

will be presented by po4a as

#. type: Plain text
#: a.adoc:2
msgid "This text should be + on two lines."
msgstr ""

which not only misses the line break but introduces an ugly "+" in the text:

This text should be + on two lines.

Many thanks!

silopolis commented 2 years ago

Did a small round trip test too and it appears to make forced line breaks (' +' at end of line) unusable in AsciiDoc 😭

test.adoc `This sentence is separated from the following by a '+' line break. + This sentence is separated from the preceding by a '+' line break.

This long forced wrapped sentence is separated from the following by a '+' line break. + This long forced wrapped sentence is separated from the preceding by a '+' line break.`

test_fr.adoc `Cette phrase est séparée de la suivante par un retour à la ligne forcé "+". + Cette phrase est séparée de la précédente par un retour à la ligne forcé "+".

Cette longue phrase à largeur fixe est séparée de la suivante par un retour à la ligne forcé "+". + Cette longue phrase à largeur fixe est séparée de la précédente par un retour à la ligne forcé "+".`

Using the --wrap-po newlines option, the POT file looks just like the source file, which is nice. But the PO files remain wrapped at col 77, which looks like a bug to me !? Having this fixed could make it usable. Best would be that po4a honors newline character after ' +' for AsciiDoc documents, whatever the choice for the '--wrap-po' option.

jnavila commented 2 years ago

Hi everyone,

I (finally) created PR #362 which aims at making the asciidoc work with newlines. If it is still useful to you, can you try this branch on your docs and reports possible failures?

Thanks in advance.

silopolis commented 2 years ago

Hello Jean-Noël,

Le mar. 28 juin 2022 à 18:49, Jean-Noël Avila @.***> a écrit :

Hi everyone,

I (finally) created PR #362 https://github.com/mquinson/po4a/pull/362 which aims at making the asciidoc work with newlines. If it is still useful to you, can you try this branch on your docs and reports possible failures?

This is very good news! LinuxCNC docs have a bunch of files that should be good for testing this quite thoroughly... Can't tell exactly when I'll be able to test this but we will for sure.

Thanks a ton for working on this 🙏👍

Have a nice evening

Bien à toi Jé

petterreinholdtsen commented 2 years ago

I tested the new patch in https://github.com/mquinson/po4a/commit/f851b918fa4ecac714e0471b39ad56bbc93b0334 on the linuxcnc asciidoc collection, and it improve the situation a lot. But I wonder, is the string matching correct? The space in front of the + character is not included in the part replaced by a newline in the POT file.

I would expect "something +\nsomething else" to be presented as "something\nsomething else" in the POT file, while at the moment it is presented as "something \nsomething else". Note the space preceeding the newline.

jnavila commented 2 years ago

Ah, you're right, the correct syntax as noted here is "a space followed by + and a carriage return". Need another change...

petterreinholdtsen commented 2 years ago

[Jean-Noël Avila]

Ah, you're right, the correct syntax as noted here is "a space followed by + and a carriage return". Need another change...

Perhaps something like this?

diff --git a/lib/Locale/Po4a/AsciiDoc.pm b/lib/Locale/Po4a/AsciiDoc.pm
index ba4e61b2..b4139762 100644
--- a/lib/Locale/Po4a/AsciiDoc.pm
+++ b/lib/Locale/Po4a/AsciiDoc.pm
@@ -376,12 +376,12 @@ BEGIN {
 sub translate {
     my ( $self, $str, $ref, $type ) = @_;
     my (%options) = @_;
-    if ( ($options{'wrap'}==1) && ($str =~ /\+\n/) ) {
+    if ( ($options{'wrap'}==1) && ($str =~ / \+\n/) ) {
         $options{'wrap'} = 0;
-        $str =~ s/([^+])\n/$1 /g;
-       $str =~ s/\+\n/\n/g;
+        $str =~ s/([^ ])\+\n/$1 /g;
+       $str =~ s/ \+\n/\n/g;
         $str = $self->SUPER::translate( $str, $ref, $type, %options);
-       $str =~ s/\n/+\n/g;
+       $str =~ s/ \n/+\n/g;
        $options{'wrap'} = 1;
     } else {
         $str = $self->SUPER::translate( $str, $ref, $type, %options );
diff --git a/t/fmt/asciidoc/LineBreak.adoc b/t/fmt/asciidoc/LineBreak.adoc
index 33541200..d4dc0c14 100644
--- a/t/fmt/asciidoc/LineBreak.adoc
+++ b/t/fmt/asciidoc/LineBreak.adoc
@@ -1,22 +1,22 @@
 = Linebreaks

 This is a paragraph with a linebreak,
-where the linebreak occurs on second line+
+where the linebreak occurs on second line +
 The second part of the paragraph is on
 a second line.

-* We can do the same with a list item+
+* We can do the same with a list item +
 After the linebreak, the bullet point is continued.
 * In this bullet point, the list item
-is continued on the next line.+
+is continued on the next line. +
 Then the line is broken in a second part.

 ..........................
-This is a Literal block.+
+This is a Literal block. +
 This is a Literal block.
 ..........................

 __________________________
-This is a Quote block.+
+This is a Quote block. +
 This is a Quote block.
 __________________________
diff --git a/t/fmt/asciidoc/LineBreak.norm b/t/fmt/asciidoc/LineBreak.norm
index 3af9cbde..8c0c168c 100644
--- a/t/fmt/asciidoc/LineBreak.norm
+++ b/t/fmt/asciidoc/LineBreak.norm
@@ -1,19 +1,19 @@
 = Linebreaks

-This is a paragraph with a linebreak, where the linebreak occurs on second line+
+This is a paragraph with a linebreak, where the linebreak occurs on second line +
 The second part of the paragraph is on a second line.

-* We can do the same with a list item+
+* We can do the same with a list item +
   After the linebreak, the bullet point is continued.
-* In this bullet point, the list item is continued on the next line.+
+* In this bullet point, the list item is continued on the next line. +
   Then the line is broken in a second part.

 ..........................
-This is a Literal block.+
+This is a Literal block. +
 This is a Literal block.
 ..........................

 __________________________
-This is a Quote block.+
+This is a Quote block. +
 This is a Quote block.
 __________________________
diff --git a/t/fmt/asciidoc/LineBreak.trans b/t/fmt/asciidoc/LineBreak.trans
index 734758ff..aa0608ca 100644
--- a/t/fmt/asciidoc/LineBreak.trans
+++ b/t/fmt/asciidoc/LineBreak.trans
@@ -1,19 +1,19 @@
 = LINEBREAKS

-THIS IS A PARAGRAPH WITH A LINEBREAK, WHERE THE LINEBREAK OCCURS ON SECOND LINE+
+THIS IS A PARAGRAPH WITH A LINEBREAK, WHERE THE LINEBREAK OCCURS ON SECOND LINE +
 THE SECOND PART OF THE PARAGRAPH IS ON A SECOND LINE.

-* WE CAN DO THE SAME WITH A LIST ITEM+
+* WE CAN DO THE SAME WITH A LIST ITEM +
   AFTER THE LINEBREAK, THE BULLET POINT IS CONTINUED.
-* IN THIS BULLET POINT, THE LIST ITEM IS CONTINUED ON THE NEXT LINE.+
+* IN THIS BULLET POINT, THE LIST ITEM IS CONTINUED ON THE NEXT LINE. +
   THEN THE LINE IS BROKEN IN A SECOND PART.

 ..........................
-THIS IS A LITERAL BLOCK.+
+THIS IS A LITERAL BLOCK. +
 THIS IS A LITERAL BLOCK.
 ..........................

 __________________________
-THIS IS A QUOTE BLOCK.+
+THIS IS A QUOTE BLOCK. +
 THIS IS A QUOTE BLOCK.
 __________________________

-- Happy hacking Petter Reinholdtsen

mquinson commented 2 years ago

@petterreinholdtsen could you please provide the patch as a separate file? I seem to remember that you need to name the file "something.txt" for github to accept it, or something. Also, maybe we need to reopen the bug if it's not done ?

jnavila commented 2 years ago

@petterreinholdtsen @mquinson the fix 12ddfd7b8f902782acb13d8063b06f4174b5f1d2 was already merged.

mquinson / po4a

Asciidoc: Better message splitting needed #299