rmraya / OpenXLIFF

An open source set of Java filters for creating, merging and validating XLIFF 1.2, 2.0 and 2.1 files.
https://www.maxprograms.com/products/openxliff.html
Eclipse Public License 1.0
66 stars 18 forks source link

Invalid result after merge operation (related to Element#mergeText) #28

Closed ajacob closed 11 months ago

ajacob commented 11 months ago

Hello, thank you for sharing this tool.

I found a problem related to merge operation, I'll try to describe it in this issue

How to reproduce

test.skl

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1>%%%1%%%
</NIV1>
</ROOT>

test.xlf

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" srcLang="fr" version="2.1" trgLang="en-US">
  <file original="test.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.15.0 20230913_0710</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve">
          <ph id="ph0"/>
        </source>
      </ignorable>
      <segment state="final" id="1-0">
        <source xml:space="preserve"> Bonjour. </source>
        <target> Hello. </target>
      </segment>
      <ignorable>
        <source xml:space="preserve">
          <ph id="ph1"/>
        </source>
      </ignorable>
      <segment state="final" id="1-1">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
        <target> This text should be translated </target>
      </segment>
    </unit>
  </file>
</xliff>

command

./merge.sh -xliff test.xlf -target result.xml

actual xml result

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1>
          <G>
         Bonjour. 
           Hello. 
          </G>
         Ce text devrait être traduit  This text should be translated </NIV1>
</ROOT>

expected xml result

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Hello. </G> This text should be translated </NIV1>
</ROOT>

Investigation

First, I observed that this issue appears because I formatted the xliff file. If I rollback some changes, the result is correct. Rolled back changes are:

      <ignorable>
        <source xml:space="preserve">
          <ph id="ph0"/>
        </source>
      </ignorable>

      <ignorable>
        <source xml:space="preserve"><ph id="ph0"/></source>
      </ignorable>

(needed as well for ph1)

Looking into the code, I think I understand the problem.

In https://github.com/rmraya/OpenXLIFF/blob/v3.15.0/src/com/maxprograms/xliff2/FromXliff2.java#L294-L326

We have:

At this point joinedSource and joinedTarget point to the same content (same java objects per references)

Then l. 326 we have:

src.setContent(harvestContent(joinedSource, tags, attributes));

Which internally calls joinedSource.getContent(); which itself calls Element#mergeText.

And this is precisely were we have an issue. When harvesting for the source we call the mergeText function that mutates some XMLnodes that are also referenced by the target. After harvesting for the source, the content of target becomes invalid.

To confirm this hypothesis, I simply removed the call to #mergeText in Element#getContent and it actually fixes the problem.

Solution

I don't know what would be the best to fix this issue, I think there are different possibilities like:

If you want me to share a PR with a fix proposal, feel free to ask.

By the way, I think it could be very helpful to introduce a pom.xml in order to have maven dependencies (in which case it would be necessary to publish XMLJava in maven central)

rmraya commented 11 months ago

Please provide the original XML file you are trying to translate so I can reproduce the issue.

There are no external dependencies for this project, so there is no need for Maven or a pom.xml file.

rmraya commented 11 months ago

FWIW, the lines of code that you mention are correct. They do exactly what they are supposed to do.

Please provide the source file for testing or close this issue.

ajacob commented 11 months ago

Hello @rmraya,

Everything you need to reproduce this issue is provided above.

Here are step by step instructions:

This will produce a result.xml file with duplicated text (as shown above).

To my understanding the produced result is not valid but if you think everything behaves as expected, feel free to close this issue

rmraya commented 11 months ago

@ajacob I need the original XML used to create the XLIFF. I want to perform a full roundtrip: generate XLIFF, translate and merge.

rmraya commented 11 months ago

@ajacob You mention that you made changes to the XLIFF, I want to see the version generated by OpenXLIFF

rmraya commented 11 months ago

@ajacob:

1) Created a source file like this

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Bonjour. </G> Ce text devrait être traduit </NIV1>
</ROOT>

2) generated XLIFF with this command:

convert.sh -file original.xml -srcLang fr-FR -tgtLang en-US -2.0 -type XMLG

and obtained this:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr-FR" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/original.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/original.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment id="1-0">
        <source xml:space="preserve"> Bonjour.</source>
      </segment>
      <ignorable id="1-1">
        <source xml:space="preserve"> </source>
      </ignorable>
    </unit>
    <unit id="2">
      <segment id="2">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
      </segment>
    </unit>
  </file>
</xliff>

3) Translated like this:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr-FR" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/original.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/original.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment state="final" id="1-0">
        <source xml:space="preserve"> Bonjour.</source>
        <target xml:space="preserve"> Hello.</target>
      </segment>
      <ignorable id="1-1">
        <source xml:space="preserve"> </source>
      </ignorable>
    </unit>
    <unit id="2">
      <segment state="final" id="2">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
        <target xml:space="preserve"> This text should be translated </target>
      </segment>
    </unit>
  </file>
</xliff>

4) Merged the translated file with this command:

merge.sh -xliff original.xml.xlf

and obtained this output file:

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Hello. </G> This text should be translated </NIV1>
</ROOT>

I don't know what you did, but the code seems to work.

ajacob commented 11 months ago

I don't know what you did, but the code seems to work.

I'll try to give you more information / STR, and hopefully help you reproduce this issue.

I've created a very simple sample that involves another tool that add translations into the xlf file.

SETUP

input.xml

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1><G>Bonjour.</G> Ceci est un text multi-ligne
qui a pour objectif de mettre en évidence
un simple bug. J'espère que ça aidera.</NIV1>
</ROOT>

config_ROOT.xml

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ini-file PUBLIC "-//MAXPROGRAMS//Converters 2.0.0//EN" "configuration.dtd" >
<ini-file>
    <tag hard-break="segment">NIV1</tag>
    <tag ctype="x-bold" hard-break="inline">G</tag>
</ini-file>

CONVERT

./convert.sh -type XML -file input.xml -xliff test.xlf -srcLang fr -tgtLang en-US -skl test.skl -xmlfilter . -2.1

test.xlf

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.1" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/input.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve"><ph id="ph0"/></source>
      </ignorable>
      <segment id="1-0">
        <source xml:space="preserve">Bonjour.</source>
      </segment>
      <ignorable>
        <source xml:space="preserve"><ph id="ph1"/></source>
      </ignorable>
      <segment id="1-1">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
      </segment>
      <segment id="1-2">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
      </segment>
    </unit>
  </file>
</xliff>

test.skl

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1>%%%1%%%
</NIV1>
</ROOT>

TRANSLATION

This is an important step as the xlf file is modified by another tool that:

test.xlf

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.1" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/home/alexandre/hello/input.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve">
            <ph id="ph0"/>
        </source>
      </ignorable>
      <segment id="1-0">
        <source xml:space="preserve">Bonjour.</source>
        <target>Hello.</target>
      </segment>
      <ignorable>
        <source xml:space="preserve">
            <ph id="ph1"/>
        </source>
      </ignorable>
      <segment id="1-1">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
        <target>This is a multi-line text that is supposed to demo a simple bug.</target>
      </segment>
      <segment id="1-2">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
        <target> I hope this helps.</target>
      </segment>
    </unit>
  </file>
</xliff>

MERGE

./merge.sh -xliff test.xlf -target result.xml

result.xml

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1>
            <G>
        Bonjour.
            Hello.
            </G>
         Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug. J'espère que ça aidera.This is a multi-line text that is supposed to demo a simple bug. I hope this helps.</NIV1>
</ROOT>

And as you can see, we now have duplicated content.

To help you better understand the changes in the xlf file, here is the diff (generated by another tool)

--- test.xlf    2023-11-02 23:50:14.943475711 +0100
+++ test.xlf    2023-11-02 23:50:19.660208235 +0100
@@ -26,20 +26,27 @@
         <data id="ph1">&lt;/G&gt;</data>
       </originalData>
       <ignorable>
-        <source xml:space="preserve"><ph id="ph0"/></source>
+        <source xml:space="preserve">
+            <ph id="ph0"/>
+        </source>
       </ignorable>
       <segment id="1-0">
         <source xml:space="preserve">Bonjour.</source>
+        <target>Hello.</target>
       </segment>
       <ignorable>
-        <source xml:space="preserve"><ph id="ph1"/></source>
+        <source xml:space="preserve">
+            <ph id="ph1"/>
+        </source>
       </ignorable>
       <segment id="1-1">
         <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
+        <target>This is a multi-line text that is supposed to demo a simple bug.</target>
       </segment>
       <segment id="1-2">
         <source xml:space="preserve"> J'espère que ça aidera.</source>
+        <target> I hope this helps.</target>
       </segment>
     </unit>
   </file>
rmraya commented 11 months ago

@ajacob

Your sample does not need a custom filter configuration.

I used this command to convert input.xml to XLIFF:

convert.sh -file input.xml -srcLang fr -tgtLang en -type XMLG -2.0

and got this XLIFF:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/input.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/input.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment id="1">
        <source xml:space="preserve">Bonjour.</source>
      </segment>
    </unit>
    <unit id="2">
      <segment id="2-0">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
      </segment>
      <segment id="2-1">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
      </segment>
    </unit>
  </file>
</xliff>

I translated the XLIFF with Swordfish and got this translated version:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/input.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/input.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment state="final" id="1">
        <source xml:space="preserve">Bonjour.</source>
        <target xml:space="preserve">Hello.</target>
      </segment>
    </unit>
    <unit id="2">
      <segment state="final" id="2-0">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
        <target xml:space="preserve"> This is a multi-line text that aims to highlight a simple bug.</target>
      </segment>
      <segment state="final" id="2-1">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
        <target xml:space="preserve"> I hope it will help.</target>
      </segment>
    </unit>
  </file>
</xliff>

Then, I merged the file using this command:

merge.sh -xliff translated.xlf -target translated.xml

and obtained this translated file:

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1><G>Hello.</G> This is a multi-line text that aims to highlight a simple bug. I hope it will help.</NIV1>
</ROOT>

Note that the segments of the XLIFF that you merged are not confirmed.

ajacob commented 11 months ago

@rmraya

Your sample does not need a custom filter configuration.

The sample doesn't, the 200KB+ XML file I was working on and on which I noticed the bug does The goal of the sample is to demonstrate the bug with very concise and simple data

When I created this issue, my objective was not to find a workaround or find a way to make the sample work, the purpose was to show you that the merge operation can fail with valid data (not produced by Swordfish).

rmraya commented 11 months ago

@ajacob

As far as I can see, the merge operation works as designed. I don't see a bug. The lines of code that you mentioned are doing exactly what is intended, recover text and tags from <source> when <ignorable> does not have <target>.

If you want me to fix anything, provide a real test case that I can use to reproduce the problem.