Escape handling broken for some XML/HTML character entities

GoogleCodeExporter commented 9 years ago

Repro steps:
1. Extract Moses InlineText from languagetool.xlf
tikal -xm -to languagetool.xlf.en-us languagetool.xlf
2. Translate the text to Spanish with the small En-Es MT system provided
perl ~/Documents/work/oss/m4loc/xliff/m4loc.pm -o p -n -s en -t es -m 
~/.../binarized_model/moses.ini -c ~/.../data/truecase-model.en < 
languagetool.xlf.en-us > languagetool.xlf.pb.tok.es

Observations: 
Line 9 Source:
&amp;Check Text
Line 9 Target:
& cheque texto
Expected: Target ampersand also escaped

Line 16 Source:
Word repetition (e.g. 'will will')
Line 16 Target:
Palabra la repetición &apos; ( por ejemplo , se va a &apos; )
Expected: Quotes not escaped in target

Line 32 Target:
Punto : &quot; <x id="1"/> &quot; ( <x id="2"/> ) significa <x id="3"/> ( <x 
id="4"/> ) .
This last issue does not happen with tag fixed tag handling 

Remark: Try to run deescape-special-chars.perl from Moses scripts/tokenizer 
folder

Original issue reported on code.google.com by Achi...@gmail.com on 23 Sep 2013 at 7:51

GoogleCodeExporter commented 9 years ago

Tag fixed tag handling unescapes some escapes like < in line 77 of the 
languagetool.xlf.* files

Original comment by Achi...@gmail.com on 23 Sep 2013 at 7:54

GoogleCodeExporter commented 9 years ago

The latest Moses tokenizers (including the one in v1.0) escapes certain 
"problematic" characters. Unless the user uses 
script/tokenizer/deescape-special-chars.perl, these stay escaped after decoding.

For M4Loc this can be an issue when the original XLIFF contains already escaped 
characters like in this example. E.g. segment 77 of languagetool.xlf is:
<br><b> <ph id="1">{0}</ph>. Line <ph id="2">{1}</ph>, column <ph 
id="3">{2}</ph></b><br>

- the < escapes here are intentional to distinguish markup that was present in 
the source document and markup used by XLIFF to represent placeholders and 
formatting. Formally this should not happen with XLIFF - all markup should be 
represented with XLIFF inline markup, but sometimes is inevitable in real world 
scenarios.

The right fix for this is IMHO for Moses to either support only plain text 
(i.e. no escapes necessary) or full XML support. Right now the situation is 
some in-between pseudo-XML format.

Checking in a small fix to make the handling of ampersand characters consistent 
in the tokenizer wrapper.

Original comment by Achi...@gmail.com on 24 Sep 2013 at 3:48

GoogleCodeExporter commented 9 years ago

Small ampersand issue fixed by revision 94c524daa01b

Original comment by Achi...@gmail.com on 24 Sep 2013 at 3:53

Changed state: Fixed

lvapeab / m4loc

Escape handling broken for some XML/HTML character entities #47