mquinson / po4a

Maintain the translations of your documentation with ease (PO for anything)
http://po4a.org/
GNU General Public License v2.0
121 stars 58 forks source link

Fenced divs should be either "verbatim" (as now) or "translate" (new behavior) #381

Open joelnitta opened 1 year ago

joelnitta commented 1 year ago

There seem to be some possibly related issues (#291, #357, #359), but I couldn't find anything describing exactly what I'm encountering, so I am filing a new one.

I am translating markdown with input like this (let's call this file test-long-line.md):

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor

Inline instructor notes

Note that the line with many colons ending with instructor is a pandoc fenced div and needs to remain one one line, and should not be translated.

I generate the PO file with po4a-updatepo -f text -m test-long-line.md -p test-long-line.po -o markdown --wrap-po newlines, then edit it to look as follows (call this test-long-line.po):

# SOME DESCRIPTIVE TITLE
# Copyright (C) YEAR Free Software Foundation, Inc.
# This file is distributed under the same license as the PACKAGE package.
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2022-08-24 02:44+0000\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"Language: \n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

#. type: Plain text
#: test-long-line.md:2
#, markdown-text
msgid ":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: instructor"
msgstr ""

#. type: Plain text
#: test-long-line.md:3
#, markdown-text
msgid "Inline instructor notes"
msgstr "インストラクター用メモ"

When I translate from the PO file, the instructor part gets put on a new line, even though I want to avoid this behavior.

Command:

po4a-translate -f text -m test-long-line.md -p test-long-line.po -l test-long-line.ja.md -o markdown -k 0 --width 1000 --wrap-po newlines

Output:

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
instructor

インストラクター用メモ

I also tried with nobullets as suggested in #359, but that did not work.

Thanks!

po4a dev version (4cc0afd96fbb4f2d6674f8259a7a9f7e900942d8) running in docker container joelnitta/po4a:latest

mquinson commented 1 year ago

I didn't new that fenced code blocks can also be written with columns (:). So far, they are possible with backquotes (`) and tildes (~). I think that the following patch solves your issue, but I'd like to have a full example to integrate to the test suite before I commit it to our git.

--- a/lib/Locale/Po4a/Text.pm
+++ b/lib/Locale/Po4a/Text.pm
@@ -714,7 +714,7 @@ sub parse_markdown {
         $self->pushline( $line . "\n" );
         $paragraph        = "";
         $end_of_paragraph = 1;
-    } elsif ( $line =~ /^([ ]{0,3})(([~`])\3{2,})(\s*)([^`]*)\s*$/ ) {
+    } elsif ( $line =~ /^([ ]{0,3})(([~`:])\3{2,})(\s*)([^`]*)\s*$/ ) {
         my $fence_space_before  = $1;
         my $fence               = $2;
         my $fencechar           = $3;
mquinson commented 1 year ago

@joelnitta it would be really good if you could propose an extension to https://raw.githubusercontent.com/mquinson/po4a/master/t/fmt/txt-markdown/PandocFencedCodeBlocks.md testing "your" variant of fenced blocks, please. Just tell me what text chunk should be added, and I'll integrate properly in our test suite.

joelnitta commented 1 year ago

Thanks @mquinson! I hadn't thought of this as a fenced code block, but rather as a markdown version of HTML divs (as described in the pandoc manual). But I suppose they are similar. The one thing that may differ is that pandoc fenced_divs can be nested, and I don't know if that applies to code blocks. So po4a would need to be able to account for that (again, my work-around was going to be to just not translate them, but if they were actually recognized and handled appropriately that would be even better).

I think borrowing from the pandoc manual should be fine for testing. Here are two examples.

First one is non-nested.

::::: {#special .sidebar}
Here is a paragraph.

And another.
:::::

Second one is nested.

::: Warning ::::::
This is a warning.

::: Danger
This is a warning within a warning.
:::
::::::::::::::::::
mquinson commented 1 year ago

Ok, I think it's fixed now. The fact that it can be nested made the patch more complex than I thought. Thanks for reporting.

joelnitta commented 1 year ago

Thanks @mquinson for your help with this.

Sorry to make this request after you have already closed the issue, but I hope you might consider some other ways to handle this situation.

The problem with this approach IMHO is that if there is a large amount of content within a fenced div, it all shows up as a single msgid. I think smaller msgids (generally one markdown paragraph at a time) are preferable. Also, this means that the translator may have to deal with more raw code (e.g., linebreaks (\n)) that would otherwise not show up in the PO file.

For my project I plan to crowdsource the translation part (i.e. the localization), so I want translators to be exposed to a minimum amount of code.

This is an example of what happens using the current approach.

Original text:

::::::::::::::::::::::::::::::::::::: challenge 

## Challenge 1: Can you do it?

What is the output of this command?

```r
paste("This", "new", "lesson", "looks", "good")

:::::::::::::::::::::::: solution

Output

[1] "This new lesson looks good"

:::::::::::::::::::::::::::::::::

Challenge 2: how do you nest solutions within challenge blocks?

:::::::::::::::::::::::: solution

You can add a line with at least three colons and a solution tag.

:::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::


PO file (header excluded):

. type: Fenced div block (challenge )

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:29

, no-wrap

msgid "" "\n" "## Challenge 1: Can you do it?\n" "\n" "What is the output of this command?\n" "\n" "r\n" "paste(\"This\", \"new\", \"lesson\", \"looks\", \"good\")\n" "\n" "\n" ":::::::::::::::::::::::: solution \n" "\n" "## Output\n" " \n" "output\n" "[1] \"This new lesson looks good\"\n" "\n" "\n" ":::::::::::::::::::::::::::::::::\n" "\n" "## Challenge 2: how do you nest solutions within challenge blocks?\n" "\n" ":::::::::::::::::::::::: solution \n" "\n" "You can add a line with at least three colons and a solution tag.\n" "\n" ":::::::::::::::::::::::::::::::::\n" "\n" msgstr ""


For comparison, this is the PO file generated before the patch:

. type: Plain text

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:30

, markdown-text

msgid "::::::::::::::::::::::::::::::::::::: challenge" msgstr ""

. type: Title

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:31

, markdown-text, no-wrap

msgid "Challenge 1: Can you do it?" msgstr ""

. type: Plain text

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:34

, markdown-text

msgid "What is the output of this command?" msgstr ""

. type: Fenced code block (r)

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:35

, no-wrap

msgid "paste(\"This\", \"new\", \"lesson\", \"looks\", \"good\")\n" msgstr ""

. type: Plain text

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:40

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:52

, markdown-text

msgid ":::::::::::::::::::::::: solution" msgstr ""

. type: Title

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:41

, markdown-text, no-wrap

msgid "Output" msgstr ""

. type: Fenced code block (output)

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:43

, no-wrap

msgid "[1] \"This new lesson looks good\"\n" msgstr ""

. type: Plain text

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:48

: /409952fb95fbb825992593fca10961ea_1/test.Rmd:56

, markdown-text

msgid ":::::::::::::::::::::::::::::::::" msgstr ""



I think having more `msgid` blocks will be significantly easier for translators.
mquinson commented 1 year ago

Ok, then. Let's reopen this bug. What we will need is an option to alternate between fenced-div=verbatim (as I did) and fenced-div=translate (as you propose).

I still think that we need both because the translate behavior may lead to some subtle difficulties when a nested div is inlined. In that case, the translators may want to change the location of the nested div in the englobing sentence.

joelnitta commented 1 year ago

Thanks!

A few ideas... in the later case (fenced-div=translate), if the fenced div line will show up as a msgid, perhaps include a translator note that it does not need to be translated? Another option may be my original work-around of not including fenced divs in the PO file at all (possibly related to #77).

joelnitta commented 1 year ago

@mquinson just checking in... is there anything I can do to help with this? (without knowing perl... sorry...)

This would be a great feature to have, especially because of the heavy use of fenced divs by Quarto, which is rapidly gaining popularity as a cross-language authoring system.

joelnitta commented 1 year ago

Hi @mquinson unless I'm missing something obvious, I think this should be re-opened because it does not provide an option to choose treating fenced divs as either "verbatim" or "translate".

As mentioned above, the currently implementation results in unnecessary markdown formatting (especially line breaks, \n) showing up in the PO file.

Thanks!

mquinson commented 1 year ago

I forgot everything about this issue since then, sorry. Feel free to reopen it it's appropriate, then.

joelnitta commented 1 year ago

Thanks for the re-open. Please let me know if there's anything I can clarify.

joelnitta commented 1 year ago

Actually, I'll go ahead and clarify a bit now:

Ideal behavior would be if the parsed text in the PO file accounted for all markdown formatting between fenced divs (detection of type: Title ##, etc) as well as the divs themselves. But if that is too difficult, the option to ignore fenced divs as a work-around so that any markdown formatting between them gets properly detected would be OK too.