Implement suppression POT files to remove strings from translations (like pot_in, but more user-friendly)

mquinson commented 6 years ago

Initially reported on Alioth by Michael Terry (03/05/2009):

It would be nice if I could mark some paragraphs as untranslatable.

For example, say I have a docbook flie that lists a number of file paths
as part of the documentation. These paths (like ~/.cache) show up to 
be translated, but I don't want them to be.

Maybe some format-specific comment that po4a understands? 
In the docbook example, the paragraph being preceded by 
<!-- po4a: ignore --> or something.

Comment by Denis Barbier (29/07/2010):

You are right, that would be nice. The parser can cope with 
comments only if they are embedded within a paragraph, 
which means that in XML one has to write 
<para><!-- Hey translator -->Foo bar</para> 
to have this comment printed into PO files.
One could indeed define special comments, e.g. 
<para><!-- po4a: ignore -->Foo bar</para> to have a paragraph 
copied verbatim into translated documents. Note that these comments 
will not be removed from translated documents, I believe that this is fine.

You can test by editing Locale/Po4a/Xml.pm like this: replace
if ($translate ne "")
by
if ($translate ne "" && !(defined($comments) && $comments =~ m/^\s*po4a:\s*ignore/s))
I will commit this change unless you report problems with it.

Comment by Michael Terry (29/07/2010):

That would be fine. I'd also be interested in similar syntax for man pages too.

Comment by me (30/03/2017):

Just for the record, this feature is now implemented in the man module 
(see section "Hiding text from po4a" from Locale::Po4a::Man(3pm))

Comment by me (30/03/2017):

Just to be sure: in which modules would you need this feature, Michael?

Implementing it in Sgml may be particularly challenging so I'd prefer 
not to do that if you don't really need it. Is it enough if it works in Man?

Thanks for your patience,

osamuaoki commented 6 years ago

I agree adding specific solution with a narrow scope may complicate code without much benefit. That's not good thing. But I think we can do better by adding a generic feature cleanly by not including such translation exclusion code within po4a. (There is such needs See https://bugs.debian.org/607726 . As written there, -o option may have some answer for XML but it wasn't easy for me to implement.)

What po4a should offer is independent ways to specify 2 variants of original English document. One to make POT file and another to make translated text with the help of PO file in po4a.cfg.

Both of these should be generated by the external program.

This approach allow us to include many unstranslatable contents in many parts. This is how I manage to include many auto-generated statistical data included in Debian Reference with manual convoluted Makefile. If po4a support this kind of feature, I can clean up my Makefile :-)

For XML source, we can write XSLT filter to exclude specific tag contents such as ... for use as the input for POT. The final translated document can be generated by PO and the final English document with tag contents such as ... .

For non-XML source, we can deploy CPP predecessor directive to enable similar things by pre-processing.

This approach should be non-invasive and clean, I think...

osamuaoki commented 6 years ago

This is follow up to my post yesterday.

As for implementing 2 English base input files, specifying this in current po4a/po4a.cfg syntax isn't trivial and very much confusing.

I think most reasonable approach is to create optional entries in po4a/po4a.cfg to set up custom prefilter programs:

[pot_prefilter]: optional entry to set up prefilter for input source test -> source text fed into "po4a-gettextize -m" option input file (POT generation base file)
[translation_prefilter]: optional entry to set up prefilter for input source test -> source text fed into "po4a-translate -m"option input file (Translation file generation base file)

This approach should be compatible with existing syntax while adding very generic flexibility to po4a infrastructure.

osamuaoki commented 6 years ago

Hmmm.. maybe adding option to po4a command for these prefilters may be even better.

mquinson commented 6 years ago

I like this idea of pre-filtering the input document before extracting the POT file. I think that this is a very appealing approach to solve this problem. Any help (or even better, patch) going in that direction would be really appreciated.

Thanks for the insight.

mquinson commented 4 years ago

Hello there. Actually, there is a preliminary implementation already in po4a :)

If you specify the pot_in for a given document, this is the file used to build the POT and PO files. We have an example in t-02-addendums/book-potin.conf (that I plan to rewrite as I do for all tests currently):

[po4a_langs] ja
[po4a_paths] tmp/book.pot ja:t-02-addendums/book.po.ja

[type:docbook] t-02-addendums/book-auto.xml \
        pot_in:t-02-addendums/book.xml \
        ja:tmp/book-auto.ja.xml \
        add_ja:t-02-addendums/book.addendum1 \
        opt:"-k 0 -o nodefault=\"<bookinfo> <author>\" \
                  -o break=\"<bookinfo> <author>\" \
                  -o untranslated=\"<bookinfo>\" \
                  -o translated=\"<author>\""

We have:

--- t-02-addendums/book-auto.xml        2020-04-09 00:23:24.801047067 +0200
+++ t-02-addendums/book.xml     2020-04-09 00:23:24.801047067 +0200
@@ -59,11 +59,6 @@
   </totalfake>
 </bogustag>
 </chapter>
-<chapter><title>Title: Auto add text</title>
-<para>
-This is to emulate auto added non-translated content.
-</para>
-</chapter>
 <appendix><title>Title: Optional Appendix</title>
 <para>
 Appendixes are optional.

As a result, these strings are not added to the pot, so their translation is not found in the po, so they remain unchanged. So it ... works.

But this is very cumbersome, because one has to implement the filtering externally, which kinda goes against the whole spirit of the po4a binary as opposed to the po4a-* tools.

I'd prefer to have a filter, as @osamuaoki proposed. I still need to think of how to express such a filter in the config file.

osamuaoki commented 4 years ago

Hi

On Sat, Apr 18, 2020 at 02:10:12PM -0700, Martin Quinson wrote:

Hello there. Actually, there is a preliminary implementation already in po4a :)

If you specify the pot_in for a given document, this is the file used to build the POT and PO files. We have an example in t-02-addendums/book-potin.conf (that I plan to rewrite as I do for all tests currently): ... But this is very cumbersome, because one has to implement the filtering externally, which kinda goes against the whole spirit of the po4a binary as ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ opposed to the po4a-* tools. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ I see.

I'd prefer to have a filter, as @osamuaoki proposed. I still need to think of how to express such a filter in the config file.

It took me a while for me to understand what exactly you are talking. I hope I understood it correctly.

I proposed pot_in as a feature enhancement to po4a to allow an equivalent process with po4a alone as documented in POD as below.

Special case with specifying B:
<- source files ->	<--------- build results ----------------->
master document --+--------------------------+
:
external : filtered
filtering ========X..> master
program document

V +--> translations
old PO files ----------+--> updated PO files +
^
	V
+<..........................+
(the updated PO files are manually
copied to the source of the next
release while manually updating
the translation contents)

This was a meant to be the simplest use case demonstration example of "pot_in".

FTI: Currently, I use po4a-* tools embedded in a Makefile.

Let's consider cases.

Case 1. debian-reference


I haven't migrated to the new po4a yet ;-)  But, my Makefile for
debian-reference does as follows using po4a-*:

|| <---------------------------------------- source files ->|<--------- build results ---------------------------->
||
|| non-XML                                   +-----> master document ------------------+--> English XML --+--> HTML
|| master document template -+    external --+        (master)   XML                   |                  |
||                           +--> merging                                              |                  +--> PDF
|| supplimental data --------+    program  --+                                         |
|| non-XML  ^                    to generate +----> filtered master document           |
||          |                    XML files           (pot_in)    |       XML           V
||          |                                                    |(pot)                +--> translations -+--> HTML
||          |                                                    V                     ^                  |
||    generation script                   old PO files ----------+--> updated PO files +             XML  +--> PDF
||    wget/sed/...                             ^ (po)                 (po) |
||                                             |                           V
||                                             +<..........................+
||                                           (the updated PO files are manually
||                                            copied to the source of the next
||                                            release while manually updating
||                                            the translation contents)

For this case, both (master) and (pot_in) should be generated at the
same time by the external merging program even after migrating to po4a.

With (pot_in) feature, I can migrate to po4a.

Case 2. XML attribute support with XSLT

As discussed in: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=607726#21

The basic idea is to use tags with attribute in the original XML file. ... like:

... ... ... This provides very fine grained control over which parts are translated than selecting by tags. If we chose attribute name from DocBook defined ones, maybe we use "role" as: ... ... ... Most likely, such XML file with extra attributes can be used as (master) without modification to generate PDF etc. If you wish to make XML file without extra attributes, removing such attributes can be done by a trivial XSLT script run under the xsltproc command. As for getting (pot_in) filtered master document XML, removing such attribute marked tag section from original XML is again trivial XSLT script run under the xsltproc command. Of course, XSLT script based solution in conjunction with the use of (master) and (pot_in) is very versatile, though the use of XSLT script may be a bit cumbersome for many po4a users. Case 3. Fine grained control support of NOTRANSLATE attribute in po4a Instead of selecting by tags, fine grained control support of NOTRANSLATE attribute can be added to po4a tool chain as a new feature. For XML, I think adding such feature is relatively simple. For other formats such as POD or MAN, we can implement similar feature by preceding each translation section by carefully selected comment line etc. Unlike * tag based approach as you indicated * (master) and (pot_in) approach in case 1 and 2 as I mentioned, this approach should be much easy to figure out for many po4a users. Regards, Osamu

mquinson commented 4 years ago

Hello @osamuaoki, thanks for the detailed answer.

I must however confess that I'm a bit lost here. You speak of the pot_in feature as something that would be desirable, but it's already implemented, right? I just pushed some tests to ensure that it will continue to work in the future.

So, maybe you mean that this bug can be closed because the filtering thing that I was suggesting is less useful? If so, I agree. I changed my mind in the meanwhile, and I think that it is much easier to keep the filtering out of the po4a program, that is already rather complex. I don't think that we can find a solution that fits all needs to specify the filtering command line in the po4a.conf, so I take it back: pot_in is sufficient from my point of view, and we could close this issue.

What would be needed from your point of view to close this?

Thanks for your help, Mt.

mquinson commented 4 years ago

Hello @osamuaoki, could you please help me understanding what remains to be done before closing this issue ?

Thanks in advance,

erciccione commented 3 years ago

I have some paragraphs in a text file (markdown) that i don't want to have translated since they mostly contain code. Ideally i would have a pot file with some paragraphs marked as "not to translate" that would be ignored during conversions, so to keep them in english in the translated file.

I've been looking for ways to achieve that, but it's hard to find a solution. I now found this issue but it's still not clear to me if it's now possible to mark some paragraphs not-for-translations. Is there currently a native way to achieve this? Is there a workaround i'm missing?

mquinson commented 3 years ago

Hello @erciccione, sorry for the delay.

Did you see https://po4a.org/man/man1/po4a.1.php#lbAN in the documentation?

If you've read the doc and it's not sufficient, could you please elaborate on your question? The idea is to produce a filtered file where the content you want to hide is removed. This filtered file should be used as pot_in.

Maybe your question is about how to produce that filtered file removing the content you want to hide? Well, this is not in the field of po4a: you have to filter it on your side, to produce the file that will be used as pot_in in po4a.

I'm not quite sure of how I'd do this for text files. In markdown, I'd use specific markers in comments to indicate the beginning and end of such area to hide, and then I'd come up with a small crude Perl script do do the actual filtering.

jnavila commented 3 years ago

I don't think prefilter is a correct solution. If I understand correctly, po4a would not see the content that is tagged as no-translated when generating the pot files because, it would simply be eliminated from original content before. But, when po4a would blend the translations, the eliminated parts would need to be present and they would be counted as not translated, thus defeating the translations statistics of the file and the threshold logic.

mquinson commented 3 years ago

Well, that's the currently implemented solution :) What would you propose as a replacement?

mquinson commented 3 years ago

Just to be sure we are on the same pace here, @jnavila: Filtering is already implemented and integrated to po4a since several years already. If you want to update it to make it easier for the users, be my guest, but that's already working. There is even some tests.

One thing we could do is to improve Po.pm so that it does not could missing entries as untranslated. That should be rather easy to implement, but it could have bad side effects for people using the po4a-* subscripts in the wrong order. That's a drawback with which I could live, probably.

jnavila commented 3 years ago

OK. Thank you for clearing up what's done and what could be enhanced. I cannot commit on changes right now.

osamuaoki commented 3 years ago

As far as functional features are concerned, I think this is done deal. Now line matching rules can be created more intuitively, too for addendum.

As for easy usage for end-users for filtering, we may need XML filtering documentation to use attribute with example XSLT+Makefile since they are nontrivial for most people.

So let's rename this issue 77.

mquinson commented 10 months ago

Hello,

reading again the logs of this issue, I come to the conclusion that the feature may be implemented and documented, it is still very cumbersome to use. I like very much the idea of @erciccione, of suppression POT files that would be a POT file which msgids get automatically marked as "not to translate". I think that is would be much easier to manage for the users, as you just have to check on your (usual) POT to seach for the entries that shouldn't be here, and copy/paste them unchanged to your suppression file to have them automatically removed. We could even probably warn about unused entries in the suppression file to ease the maintenance of this file (probably, because I'm not sure about split settings which could get in the way).

Internally, that shouldn't be too complex to implement, a bit like the po4a-gettextize internal behavior: after building the pot file from the master documents, just before writing it to disk, you load the suppression file in a new PO object, and then iterate over the entries of that PO object to remove those msgids from the master POT files.

Unfortunately, I'm not sure I'll have to implement this before releasing the long overdue v0.70, so I'm writing this to (1) confirm with you guys that this new feature would be the right answer to your need (2) remember about it the next time that I find some time for po4a.

osamuaoki commented 8 months ago

Since my target is XML, filtering by XML-tag is easy. I basically use po4a in 2 stage. Once on filtered XML to create template for PO file. Second time with original XML to produce final result. But for markdown, this strategy doesn't work.

I agree creating blocking-pot file is a reasonable idea to address this needs via data-source neutral way.

msguniq-like filtering is all you need to implement .

mquinson / po4a

Implement suppression POT files to remove strings from translations (like pot_in, but more user-friendly) #77