Closed GoogleCodeExporter closed 9 years ago
Tag fixed tag handling unescapes some escapes like < in line 77 of the
languagetool.xlf.* files
Original comment by Achi...@gmail.com
on 23 Sep 2013 at 7:54
The latest Moses tokenizers (including the one in v1.0) escapes certain
"problematic" characters. Unless the user uses
script/tokenizer/deescape-special-chars.perl, these stay escaped after decoding.
For M4Loc this can be an issue when the original XLIFF contains already escaped
characters like in this example. E.g. segment 77 of languagetool.xlf is:
<br><b> <ph id="1">{0}</ph>. Line <ph id="2">{1}</ph>, column <ph
id="3">{2}</ph></b><br>
- the < escapes here are intentional to distinguish markup that was present in
the source document and markup used by XLIFF to represent placeholders and
formatting. Formally this should not happen with XLIFF - all markup should be
represented with XLIFF inline markup, but sometimes is inevitable in real world
scenarios.
The right fix for this is IMHO for Moses to either support only plain text
(i.e. no escapes necessary) or full XML support. Right now the situation is
some in-between pseudo-XML format.
Checking in a small fix to make the handling of ampersand characters consistent
in the tokenizer wrapper.
Original comment by Achi...@gmail.com
on 24 Sep 2013 at 3:48
Small ampersand issue fixed by revision 94c524daa01b
Original comment by Achi...@gmail.com
on 24 Sep 2013 at 3:53
Original issue reported on code.google.com by
Achi...@gmail.com
on 23 Sep 2013 at 7:51