mquinson / po4a

Maintain the translations of your documentation with ease (PO for anything)
http://po4a.org/
GNU General Public License v2.0
127 stars 62 forks source link

AsciiDoc exclude cross-references and anchors ids from the *.po(t) translation file #400

Open benoitrolland opened 1 year ago

benoitrolland commented 1 year ago

First, thank you so much for the great job of including assciidoc into po4, allowing translation from asciidoc files. Reading the Locale::Po4a::AsciiDoc documentation, I do not see how to exclude anchors ids , cross reference ids and custom-ids from the generated *.po(t) translation file. I believe it might prevent an effective automatic translation of asciidoc files using po4a. Let find the details here

mquinson commented 1 year ago

Hello @benoitrolland, thanks for the feedback.

Could you please provide me with an example file exemplifying what you want to exclude? I'm not expert in all the formats handled by po4a and often struggle to generate such test cases myself.

Thanks,

jnavila commented 1 year ago

Hi @benoitrolland

Refering to your points, first please understand that cross-references are in-line formattings and as such, they are not modified in the string to translate. Take for example the following text:

See the <<URLS,GIT URLS>> section below for more information on specifying repositories.

We want to keep the crosslink reference and how it is formatted. Po-format is not powerful enough (and has not been designed) to handle attributes in-line text. It is up to the translator to understand this formatting and make the translation follow the same pattern. In some languages the translation may completely modify the way the sentence is written and where the crosslink is.

As for your points:

Hope this clarifies your issues.

benoitrolland commented 1 year ago

Hello @benoitrolland, thanks for the feedback.

Could you please provide me with an example file exemplifying what you want to exclude? I'm not expert in all the formats handled by po4a and often struggle to generate such test cases myself.

Thanks,

from an asciidoc file containing:

[[ill-sketches-intro,Sketches]]
[NOTE,icon=texte-introduction.svg]
.Sketches
====
. <<img-Things-various>>
====

(...)

[[img-Things-various, Various things]]
.<<ill-sketches-intro,Sketches: >><<img-Things-various>>
[caption=""]
image::intro/07ThingsVarious.jpg[img-Things-various,180,100,float="left",align="center"]

Using po4a version 0.69.,

the generated *.pot file content is:

    #. type: Block title
    #: bookname.adoc.pp:343
    #, no-wrap
    msgid "<<ill-sketches-intro,Sketches: >><<img-Things-various>>"
    msgstr ""

    #. type: Positional ($1) AttributeList argument for macro 'image'
    #: bookname.adoc.pp:345
    #, no-wrap
    msgid "img-Things-various"
    msgstr ""

    #. type: Target for macro image
    #: bookname.adoc.pp:345
    #, no-wrap
    msgid "intro/07ThingsVarious.jpg"
    msgstr ""

    #. type: Block title
    #: bookname.adoc.pp:2469
    #: bookname.adoc.pp:2483
    #: bookname.adoc.pp:2492
    #: bookname.adoc.pp:2517
    #: bookname.adoc.pp:2545
    #: bookname.adoc.pp:2576
    #, no-wrap
    msgid "Sketches"
    msgstr ""

    #. type: delimited block =
    #: bookname.adoc.pp:2472
    msgid "<<img-Things-various>>"
    msgstr ""

Given the Locale::Po4a::AsciiDoc.3pm documentation available, I could declare in the source asciidoc my image macro for it not to be translated like this: //po4a: macro image[]

The po4a generated *.pot file now contains:

    #. type: Block title
    #: bookname.adoc.pp:343
    #, no-wrap
    msgid "<<ill-sketches-intro,Sketches: >><<img-Things-various>>"
    msgstr ""

    #. type: Block title
    #: bookname.adoc.pp:2469
    #: bookname.adoc.pp:2483
    #: bookname.adoc.pp:2492
    #: bookname.adoc.pp:2517
    #: bookname.adoc.pp:2545
    #: bookname.adoc.pp:2576
    #, no-wrap
    msgid "Sketches"
    msgstr ""

    #. type: delimited block =
    #: bookname.adoc.pp:2472
    msgid "<<img-Things-various>>"
    msgstr ""

But would you know how to make the *.po translation file:

Simply said, how to exclude cross-references and anchors ids as well as custom-ids from the po4a generated *.po(t) translation file.

benoitrolland commented 1 year ago

Hi @benoitrolland

Refering to your points, first please understand that cross-references are in-line formattings and as such, they are not modified in the string to translate. Take for example the following text:

See the <<URLS,GIT URLS>> section below for more information on specifying repositories.

We want to keep the crosslink reference and how it is formatted. Po-format is not powerful enough (and has not been designed) to handle attributes in-line text. It is up to the translator to understand this formatting and make the translation follow the same pattern. In some languages the translation may completely modify the way the sentence is written and where the crosslink is.

As for your points:

  • take in account "Various things" since only the reference key "img-Things-various" seems to be candidate for translation. ref: [[img-Things-various, Various things]] Anchor names are not translated by default. During the process of internationlization of asciidoc source, the anchors in the text should be made formal, and not rely on string which are translated.
  • only take in account the string value ("Sketches: ")of <<ill-sketches-intro,Sketches: >> This is a corner case of the more general case where the crosslink appears in the middle of a sentence and requires context for being translated.
  • ignore key-only references like: <<img-Things-various>> I don't think the proposed string is referring to the key only reference, but that it is the string of the picture description. As already said, cross references are passed "as is".

Hope this clarifies your issues.

Thank you for reading my case. I understand but this seems to disqualify Po4a/asciidoc as a candidate for a fully automated translation ...

jnavila commented 1 year ago

Sorry to read that.

One point I'd like mention though: you won't find any translation tool that can "exclude cross-references and anchors ids as well as custom-ids", because this information is needed by the translation tool to correctly guide the translator into converting the original anchors and cross-refs into the translated ones.

benoitrolland commented 1 year ago

Yes, but maybe a scenario where isolated text could first be translated and then reviewed would help asciidoc/po4a adapt to new challenges in automation, not to say that text references are often used for isolated text. Beside that some tools like deepl.com are smart enought to not translate elements like <<img-Things-various>> Maybe could it evolve to translate Sketches: when within asccidoctor elements like <<ill-sketches-intro,Sketches: >><<img-Things-various>>
The remaining problem in that case is when string like "Various Things" is not reported at all in the .pot/.po file (like in [[img-Things-various, Various things]])

jnavila commented 1 year ago

The remaining problem in that case is when string like "Various Things" is not reported at all in the .pot/.po file (like in [[img-Things-various, Various things]])

This seems to be a bug, indeed. I'll look into it.

mquinson commented 1 year ago

Hello,

I come a bit after the party here, but I wanted to mention that there is a notion of placeholder in the XML module of po4a, where a specific tag and all its content can be hidden by po4a, and replaced with <place attr="thetagtoprotect" id="0"> to ensure that (1) it wont bother the translators (2) the translators will not try to translate it when they should not (3) they wont break the content formatting. Indeed, po4a then checks that the translated string still contains the placeholders it's expecting when reinjecting the content.

Maybe something similar could be done here. For example, some text <<ill-sketches-intro,Sketches: >> blah <<img-Things-various>> could result in the following PO chunk:

msgid "some text [PLACEHOLDER 1] blah [PLACEHOLDER 2]"
msgstr ""

msgid "Sketches"
msgstr ""

Also, without the text around (ie for the content <<ill-sketches-intro,Sketches: >><<img-Things-various>>) we could avoid generating a msgid containing only placeholders and skip it from the PO file.

I did not dig into the code, but I think that all this could be possible. If it does not make the code too ugly, that'd probably be a good set of improvements, don't you think? But my main concern here is that @benoitrolland was speaking of fully automated translation process. If some robot changes PLACEHOLDER to e.g. ESPACE RÉSERVÉ (the french for placeholder), then the whole process would fail. I'm not sure of how to "collaborate" with a fully automated translation system here.

Btw, if [...] does not sound very asciidoc-ish, do not hesitate to use another markup.

jnavila commented 1 year ago

Replacing the markup of asciidoc by another markup that is not recognized by any translation tool is useless. The replacing markup must be natively handled by po. To this end, we can use placeholders which are related to the programming languages natively handled by gettext and that do not interfere with asciidoc's own markup. For instance:

I haven't had a look at how common translation applications and automatic translation tools handle these tags to select the most supported tag system.

mquinson commented 1 year ago

You are perfectly right. I tend to personally prefer that the <placeholder id="1"> version, because I find it more explicit, but both of your proposal could do the trick. We could also add a comment to the msgid explaining to manual translators to not change these strings.

silopolis commented 1 year ago

As much as I know, placeholder syntax is configurable in TMSs. At least, this is the case in Weblate, Crowdin and Transifex. So, having a clear syntax for placeholders, easily matched by a pattern in these tools would be an awesome improvement! 🤩

mquinson commented 1 year ago

@silopolis do you have a link or two to describe what's existing in these TMS? Thx

silopolis commented 1 year ago

Le sam. 15 avr. 2023 à 22:16, @mquinson wrote :

do you have a link or two to describe what's existing in these TMS? Thx

Sure :)

For Crowdin:

For Transifex:

For Weblate:

Hope that helps

TY J