lvapeab / m4loc

Automatically exported from code.google.com/p/m4loc
GNU Lesser General Public License v3.0
0 stars 0 forks source link

mod_tokenizer is not unescaping & (not a problematic character for Moses) #8

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
1. Open a command window and change into the ./xliff directory
2. Run "xliff2moses.bat .\t\languagetool.xlf en-us" (or "./xliff2moses.bat 
.\t\languagetool.xlf en-us" on Unix) 
3. Open .\t\languagetool.xlf.tok.en-us in a text editor
4. View line 9

"&Check Text"

Expected:
"& Check Text"

The ampersand is not a problematic character for the Moses decoder. This worked 
in an earlier version of the tokenizer.

Original issue reported on code.google.com by Achi...@gmail.com on 3 Mar 2011 at 1:45

GoogleCodeExporter commented 9 years ago
it's not a problem to implement it. But my vision is to stay as transparent as 
possible: all XML entities (&,<,...) are encoded like the example above (< 
&,...)

Original comment by xhu...@gmail.com on 3 Mar 2011 at 9:17

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 3 Mar 2011 at 12:34

GoogleCodeExporter commented 9 years ago
If you are going with this logic you would also have to escape ' and " 
characters:
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Pr
edefined_entities_in_XML

But this would directly affect the decoding (e.g. "Don 't").

So I propose to only escape characters that cause problems in the Moses decoder 
(<,>,[,],|). To clarify: the <,> around the inline elements should stay, as 
they get removed before the content is run through the decoder.

Original comment by Achi...@gmail.com on 3 Mar 2011 at 3:51

GoogleCodeExporter commented 9 years ago
done in r.62
line 172

Original comment by xhu...@gmail.com on 11 Mar 2011 at 10:01

GoogleCodeExporter commented 9 years ago
Emits warning now:
WARNING: incorrectly created original XLIFF. String: "& Check" should be 
wrapped in special tags.
Doesn't seem to be necessary for this case, only escaped XML/HTML-style tags.

Original comment by Achi...@gmail.com on 14 Mar 2011 at 2:18