po4a-gettextize when titles and bodies have the same string

petterreinholdtsen commented 2 years ago

When identical strings are present several places in a document, in different contexts, po4a-gettextize objects and claim something is wrong. Here is an example:

% cat > simple.adoc
= text1
test1
% cat > simple_es.adoc
= trans1
trans1
% po4a-gettextize -f AsciiDoc -M UTF-8 -L UTF-8 -m simple.adoc -l simple_es.adoc 
po4a gettextization: Structure disparity between original and translated files:
msgid (at simple.adoc:1) is of type 'Title =' while
msgstr (at simple_es.adoc:1 simple_es.adoc:2) is of type 'Plain text'.
Original text: text1
Translated text: trans1
(result so far dumped to gettextization.failed.po)
The gettextization failed (once again). Don't give up, gettextizing is a subtle art, 
but this is only needed once to convert a project to the gorgeous luxus offered by 
po4a to translators.
Please refer to the po4a(7) documentation, the section "HOWTO convert a pre-existing 
translation to po4a?" contains several hints to help you in your task
%

This can for example happen if a section title and a image caption contain the same string. Is there a good way to handle this during migration to po4a?

petterreinholdtsen commented 2 years ago

[Marco Ciampa]

You just inserted a \n here... if you want it to be on multiple lines you have to escape newlines

po4a-gettextize -f AsciiDoc -M UTF-8 -L UTF-8 -m a_en.adoc -l \ a_es.adoc

This was a cut-n-paste error since removed from the github issue. The command line below this one is the problematic and relevant for this issue.

See <URL: https://github.com/mquinson/po4a/issues/334 > for the updated description. -- Happy hacking Petter Reinholdtsen

mquinson commented 2 years ago

What I did when such an error occured to me was to remove the offending parts in both the master and local doc. That's suboptimal but it works.

petterreinholdtsen commented 2 years ago

[Martin Quinson]

What I did when such an error occured to me was to remove the offending parts in both the master and local doc. That's suboptimal but it works.

Right. Could po4a-gettextize be thought to accept a list of contexts for a given string, if the string is used in several contexts? It is the only way I can think of to handle such case. -- Happy hacking Petter Reinholdtsen

mquinson commented 2 years ago

[Petter Reinholdtsen]

Right. Could po4a-gettextize be thought to accept a list of contexts for a given string, if the string is used in several contexts? It is the only way I can think of to handle such case.

This is exactly what should be done in the future, and that's why I didn't close the issue while providing the crude workaround that I have.

jnavila commented 2 years ago

Right. Could po4a-gettextize be thought to accept a list of contexts for a given string, if the string is used in several contexts? It is the only way I can think of to handle such case.

Would you expect the same string to be translated differently wrt to their type (hence needing type as context) or would you prefer that they are considered identical when running po4a-gettextize? Or should po4a-gettextize be smarter and check whether the strings are already translated differently?

petterreinholdtsen commented 2 years ago

[Jean-Noël Avila]

Would you expect the same string to be translated differently wrt to their type (hence needing type as context) or would you prefer that they are considered identical when running po4a-gettextize? Or should po4a-gettextize be smarter and check whether the strings are already translated differently?

Given that po4a-gettextize end up with only fuzzy strings that need to be manually checked anyway, I would be refectly ok with them being considered identical.

-- Happy hacking Petter Reinholdtsen

mquinson commented 2 years ago

Hello,

I think that this shows a flaw in the logic of gettexization, and I added an option --keep-temps to the script to help understanding it.

The main algorithm is to build a POT from the master doc, another POT from the localized POT, and then iterate both POT files in sequence, matching the msgids together while checking that the type of each entry matches. It works very well when each string appears in exactly one msgid, but it fails in case of dupplicates.

Assume this master file :

# Hello

## Hello

sample paragraph.

Localized file:

# HELLO

## HELLO VERSION 2

SAMPLE PARAGRAPH

This gives this master POT file (notice how the first entry represents 2 strings of the document):

#. type: Title ##
#: A.md:1 A.md:3
#, markdown-text, no-wrap
msgid "Hello"
msgstr ""

#. type: Plain text
#: A.md:5
#, markdown-text
msgid "sample paragraph."
msgstr ""

and that localized POT file:

#. type: Title #
#: B.md:1
#, markdown-text, no-wrap
msgid "HELLO"
msgstr ""

#. type: Title ##
#: B.md:3
#, markdown-text, no-wrap
msgid "HELLO VERSION 2"
msgstr ""

#. type: Plain text
#: B.md:5
#, markdown-text
msgid "SAMPLE PARAGRAPH"
msgstr ""

That gettextization naturally fails, with the following message:

po4a gettextization: Structure disparity between original and translated files:
msgid (at A.md:1 A.md:3) is of type 'Title ##' while
msgstr (at B.md:1) is of type 'Title #'.
Original text: Hello
Translated text: HELLO

mquinson commented 2 years ago

The solution would be to not merge the entries in the POT, even if it creates invalid POT files.

The first step in the code to that regard is https://github.com/mquinson/po4a/blob/master/lib/Locale/Po4a/TransTractor.pm#L1012 where the TransTractor.pm pushes a new string to the output POT file because the format used translate() on that string.

This then goes to https://github.com/mquinson/po4a/blob/master/lib/Locale/Po4a/Po.pm#L1387 where the PO file notices that this entry is defined twice, and tries to react.

We should instruct Po.pm to not merge the entries as it does now, but change the string (maybe adding a "_dup" at its end so that the hash table still works, and so that the fuzzying thing of gettext will merge the strings afterward) and push it with no further modification.

An extra difficulty will be to pass to Po.pm the parameter "hey, we are doing a gettextization, not a usual thing, so please do not merge the PO entries" along the path since that call stack goes through the formats calling translate(). Maybe that option could be a global option of the PO file, set somewhere around https://github.com/mquinson/po4a/blob/master/po4a-gettextize#L398

If someone wants to dig into it, please be my guest, I won"t have any time for it in the near future.

Thanks in advance,

smoe commented 2 years ago

We observed this frequently in the asciidoc documentation when this involved images. Those appear in some section with some title, may have a caption, and an alt text. A longer document may also reuse the same wording as a subsubsection header a couple of hundred lines below to present some more details on what was already outlined above.

silopolis commented 2 years ago

Le mar. 10 mai 2022 à 12:28, Jérémie Tarot @.***> a écrit :

Will try to test this and report...

So, not knowing exactly how much this is probing but the result is that the following files do not raise any issue unless you uncomment the "BUG" lines.

test.adoc

= This is a text

.This is a text(((This is a text)))
image::common/images/emc2-intro.png["This is a text"]

== This is a text

This is a text::
// BUG  This is a text

This is a text

* This is a text
**  This is a text

test_es.adoc

= Este es un texto

.Este es un texto(((Este es un texto)))
image::common/images/emc2-intro.png["Este es un texto"]

== Este es un texto

Este es un texto::
// BUG  Este es un texto

Este es un texto

* Este es un texto
**  Este es un texto

test_fr.adoc

= Ceci est un texte

.Ceci est un texte(((Ceci est un texte)))
image::common/images/emc2-intro.png["Ceci est un texte"]

== Ceci est un texte

Ceci est un texte::
// BUG  Ceci est un texte

Ceci est un texte

* Ceci est un texte
**  Ceci est un texte

Actually this was a test, esto fue una prueba, ceci était un test ! ;-)

Hope it helps

silopolis commented 1 year ago

Hi,

Le mar. 10 mai 2022 à 11:15, Steffen Möller @.***> a écrit :

We observed this frequently in the asciidoc documentation when this involved images. Those appear in some section with some title, may have a caption, and an alt text. A longer document may also reuse the same wording as a subsubsection header a couple of hundred lines below to present some more details on what was already outlined above.

To add to Steffen's comment (we work on the same project), it seems to me that po4a:

is veeery picky (should be promoted as an AsciiDoc linter ;-) ), so strings should be perfectly identical, case included

doesn't like alt parameters in image block macros

doesn't like use of inline image: macro where block image:: ones should indeed be used (our fault, admittedly)

When all the above is OK, it does seem to make a good job and I could re-enable all block titles Steffen had to comment out before and use the same strings in block titles, image alt text (but only using the image::<URL>("<ALT_TEXT>,...) form), and index entries.

The only case I think still has bitten me a couple of times is when the string is also used in a section title. Will try to test this and report...

TY J

mquinson / po4a

po4a-gettextize when titles and bodies have the same string #334