jason-fox / fox.jason.translate.xliff

DITA-OT plug-in to create, auto-translate and re-merge XLIFF files, generating translated documentation in a targeted foreign language.
https://jason-fox.github.io/dita-ot-plugins/translate.xliff
Apache License 2.0
10 stars 4 forks source link

Questions around @translate #5

Open jason-fox opened 2 years ago

jason-fox commented 2 years ago

Three questions received from @xephon2

Does it take the @translate attribute into account? Are there any limitation on complex file structures? What about profiling attributes? I could not find any info about this in the docs.

jason-fox commented 2 years ago

The Plug-in takes the following block level elements:

The id is calculated using https://en.wikipedia.org/wiki/Fletcher's_checksum based on the text within the block elements defined above.

The block element itself will probably contain inline elements like span or codeph, these are processed in turn into xliff compliant <mrk> elements. Some inline elements are annotated as no-translate - see https://github.com/jason-fox/fox.jason.translate.xliff/blob/master/Customization/xsl/no-translate-elements.xsl

None of the autotranslate tools I've seen take XLIFF directly - they are usually JSON based, so when running the autotranslate dita command there is an effort to amend the texts fragments to be compliant to the tool - I think Bing does best here.

Profiling is not used - this transform is just a transform to XLIFF everything under this directory. If you need something more complex, I'd just run the dita -f dita command to get normalized DITA first.

Given the implicit restrictions, only the DITA elements listed above are currently candidates for translation - I'm not running preprocess, so I'm looking directly for elements with those names so if you've implemented specialization the transform won't find your new elements - you could just add something to a list here: https://github.com/jason-fox/fox.jason.translate.xliff/blob/master/xsl/add-md5-src.xsl#L77

jason-fox commented 2 years ago

From @xephon2

So, if I understand you and the code correctly, translate="no" is not taken into account.

I'm not sure if DITA normalize would work for us, because we assume that the entire topic structure is mirrored in all languages. We usually translate to 15 languages in average and in up to 39 in peak.

jason-fox commented 2 years ago

There is some residual @translate support existing in the plugin, since @translate is copied over from the DITA to the XLIFF 2.1 <mrk> and XLIFF 1.2. <x>elements:

As an example:

DITA source

<p>
   The following lines are the origin of
   <term translate="no" xml:lang="la">Lorem Ipsum</term> :
</p>
<ul>
   <li>
      Loves or pursues or desires to obtain
      <codeph>pain of itself</codeph>, because it
      is pain, but occasionally circumstances occur in which toil and
      pain can procure him some great pleasure.
   </li>
</ul>

XLIFF 1.2 output

The <term> in the first <trans-unit> is annotated as translate="no", the <codeph>in the second <trans-unit> is annotated as translate="no" by default.

<trans-unit xmlns:dita="dita-ot.org" approved="no" id="10044" xml:space="preserve">
    <source xml:lang="en">The following lines are the origin of 
         <x id="d3e11" ctype="x-dita-term" translate="no" xml:lang="la">Lorem Ipsum </x>: 
    </source>
    <target xml:lang="de" />
</trans-unit>

<trans-unit xmlns:dita="dita-ot.org" approved="no" id="54947" xml:space="preserve">
   <source xml:lang="en">Loves or pursues or desires to obtain 
       <x id="d3e20" ctype="x-dita-codeph" translate="no">pain of itself </x>, because it is pain,
        but occasionally circumstances occur in which toil and pain can procure him some great pleasure. 
   </source>
   <target xml:lang="de" />
</trans-unit>

XLIFF 2.1 output

The <term> in the first <unit> is annotated as translate="no", the <codeph> in the second <unit> is annotated as translate="no" by default.

<unit  id="10044" fs:fs="p">
   <originalData>
      <data id="sd4e11">&lt;term translate="no" xml:lang="la"&gt;</data>
      <data id="ed4e11">&lt;/term&gt;</data>
   </originalData>
   <segment state="initial">
      <source xml:space="preserve" xml:lang="en">
          The following lines are the origin of 
          <pc id="d4e11" dataRefStart="sd4e11" dataRefEnd="ed4e11" fs:fs="code">
             <mrk translate="no" type="term" id="md4e11">Lorem Ipsum</mrk>
          </pc>: 
      </source>
      <target xml:lang="de" />
   </segment>
</unit>

<unit  id="54947" fs:fs="li">
   <originalData>
      <data id="sd4e20">&lt;codeph&gt;</data>
      <data id="ed4e20">&lt;/codeph&gt;</data>
   </originalData>
   <segment state="initial">
      <source xml:space="preserve" xml:lang="en">
            Loves or pursues or desires to obtain 
           <pc id="d4e20" dataRefStart="sd4e20" dataRefEnd="ed4e20" fs:fs="code">
                <mrk translate="no" type="term" id="md4e20">pain of itself</mrk>
          </pc>, because it is pain, but occasionally circumstances occur in which toil and pain can 
          procure him some great pleasure. 
      </source>
      <target xml:lang="de" />
   </segment>
</unit>

The following DITA elements gain a translate="no" by default - I don't think this can be overridden though.

jason-fox commented 2 years ago

In addition to that list, the block elements are currently always translated, so unless changes are made to the existing XSLT transforms it currently makes no sense to annotate any of the following DITA elements with translate="no"

This would be easy to fix, since the plugin already ignores elements with no text within them (where the Fletcher's checksum is zero) so the XSLT logic could be amended to look for translate="no" as well.

jason-fox commented 2 years ago

I've tweaked the code and added two new test cases. Please reinstall from master:

dita uninstall fox.jason.translate.xliff
dita install https://github.com/jason-fox/fox.jason.translate.xliff/archive/master.zip

No Translate

Support added for translate="no" at topic - sub -topic and block level elements - if one of these XML element is annotated with translate="no" the text is no longer added to the XLIFF file. Also if no translatable texts are found within a DITA file the raw DITA is added directly to the skeletons without a .skl suffix.

dita -f xliff-create \
   -i ../plugins/fox.jason.translate.xliff/test/create-xliff2-translate-no/document.ditamap \
   --xliff.version=2

Yes Translate

Support added for translate="yes" - span level elements are added to the XLIFF file annotated with <mrk translate="yes">

dita -f xliff-create \
   -i ../plugins/fox.jason.translate.xliff/test/create-xliff2-translate-yes/document.ditamap \
   --xliff.version=2

The equivalent tests have also been added for XLIFF 1.2.

Could you check to see if this is working as expected?

stefan-jung commented 2 years ago

I will do, but this will take a few weeks.

jason-fox commented 2 years ago

What about profiling attributes? I could not find any info about this in the docs.

To implement profiling, the current mechanism I see working for this is to run the normalized dita -f dita process (with flags) and then run the dita -f create-xliff2 on the normalized result. Is that viable? It would also be possible to add another transform to do the normalization as a pre-process step and integrate it into one step.

The question is whether your existing *.dita files or your normalized *.dita need to be the basis of your XLIFF skeletons.

I've always been skeptical that XLIFF can work successfully with DITA <keyword> mapping to a noun like a brand name - particularly where languages have cases an genders, so I'd be interested to understand how you manage this.

stefan-jung commented 2 years ago

Hi @jason-fox,

We would need the translate="no" support to slurp through content which should not be translated. So, in a Chinese file, the English content flagged with translate="no" should be kept. To give you an example: We are following the principle the oXygen team is suggesting in the shipped sample DITA project. For reusable content, we use tables. The reusable chunk is stored in the left column. The right column contains a description for explaining what the chunk is used for. This must be blocked for the translator but kept in English for the technical writer. Translating this meta-information would cost us an incredible amount of money.

The given example below shows a warehouse topic containing a single phrase which is defined in the technical standard EN 60335-2-24. This means, all content of the file should be remain English, only the text WARNING: fill with potable water only. should be translated.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="urn:dometic:names:tc:dita:rng:dometicWarehouse.rng" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="urn:dometic:names:tc:dita:rng:dometicWarehouse.rng" schematypens="http://purl.oclc.org/dsdl/schematron"?>
<dometic-warehouse id="WHS_EN60335-2-24" xml:lang="en-US">
  <title translate="no">EN 60335-2-24</title>
  <conbody>
    <table frame="all" rowsep="1" colsep="1">
      <tgroup cols="2">
        <colspec colname="c1" colnum="1" colwidth="1.0*"/>
        <colspec colname="c2" colnum="2" colwidth="1.0*"/>
        <thead>
          <row>
            <entry translate="no">Component</entry>
            <entry translate="no">Description</entry>
          </row>
        </thead>
        <tbody>
          <row>
            <entry>
              <ph id="use-potable-water">WARNING: fill with potable water only.</ph>
            </entry>
            <entry translate="no">
              <p>7.12 The instructions for <b>ice-makers</b> not intended to be connected to the
                water supply shall state the substance of this warning.</p>
            </entry>
          </row>
        </tbody>
      </tgroup>
    </table>
  </conbody>
</dometic-warehouse>

I don't think that normalizing content before translation is a viable solution for us. Fluenta (https://github.com/rmraya/Fluenta) is able to mirror the entire topic and directory structure. IMHO I cannot re-create the entire directory, map and topic structure from normalized DITA. But maybe I'm wrong. For us it is a must that all directories are correctly mirrored. Whenever something happens in the en-US directory, the change should be mirrored in the, for example, de-DE directory, when the English topic is used in a map which I slurp through the translation process. We are dealing with a huge amount of maps, topics, languages and publications. Each leaf of my directory tree has (or will have) a structure like this:

root/
├── product A
│   └── topics/
│       ├── ar-EG/
│       ├── bg-BG/
│       ├── cs-CZ/
│       ├── da-DK/
│       ├── de-DE/
│       ├── el-GR/
│       ├── es-ES/
│       ├── et-EE/
│       ├── fi-FI/
│       ├── fr-FR/
│       ├── he-IL/
│       ├── hr-HR/
│       ├── hu-HU/
│       ├── id-ID/
│       ├── is-IS/
│       ├── it-IT/
│       ├── ja-JP/
│       ├── ka-GE/
│       ├── ko-KR/
│       ├── lt-LT/
│       ├── lv-LV/
│       ├── mk-MK/
│       ├── ms-MY/
│       ├── nl-NL/
│       ├── no-NO/
│       ├── pl-PL/
│       ├── pt-PT/
│       ├── ro-RO/
│       ├── ru-RU/
│       ├── sk-SK/
│       ├── sl-SI/
│       ├── sr-RS/
│       ├── sv-SE/
│       ├── th-TH/
│       ├── tr-TR/
│       ├── uk-UA/
│       ├── vi-VN/
│       ├── zh-CN/
│       └── zh-TW/
└── Product B
    └── topics/
        ├── ar-EG/
        ├── bg-BG/
        ├── cs-CZ/
        ├── da-DK/
        ├── de-DE/
        ├── el-GR/
        ├── es-ES/
        ├── et-EE/
        ├── fi-FI/
        ├── fr-FR/
        ├── he-IL/
        ├── hr-HR/
        ├── hu-HU/
        ├── id-ID/
        ├── is-IS/
        ├── it-IT/
        ├── ja-JP/
        ├── ka-GE/
        ├── ko-KR/
        ├── lt-LT/
        ├── lv-LV/
        ├── mk-MK/
        ├── ms-MY/
        ├── nl-NL/
        ├── no-NO/
        ├── pl-PL/
        ├── pt-PT/
        ├── ro-RO/
        ├── ru-RU/
        ├── sk-SK/
        ├── sl-SI/
        ├── sr-RS/
        ├── sv-SE/
        ├── th-TH/
        ├── tr-TR/
        ├── uk-UA/
        ├── vi-VN/
        ├── zh-CN/
        └── zh-TW/