No <,>,[,] or | or non-printing characters should be output by the tokenizer

GoogleCodeExporter commented 9 years ago

1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us > 
languagetool.xlf.tok.en-us

Compare line 77 of
languagetool.xlf.en-us:
&lt;br>&lt;b> <x id="1"/>. Line <x id="2"/>, column <x id="3"/>&lt;/b>&lt;br>

languagetool.xlf.tok.en-us:
< br > < b > <x id="1"/> . Line <x id="2"/> , column <x id="3"/> < / b > < br >

Later the <x> tags will be removed, but the remaining < and > characters around 
the b tags will create problems with Moses. See
http://article.gmane.org/gmane.comp.nlp.moses.user/4123
(this is only from Feb-14, so different from what we discussed earlier)

Therefore expected:
&lt;br&gt;&lt;b&gt; <x id="1"/>. Line <x id="2"/>, column <x 
id="3"/>&lt;/b&gt;&lt;br&gt;

Original issue reported on code.google.com by Achi...@gmail.com on 24 Feb 2011 at 10:09

GoogleCodeExporter commented 9 years ago

Corrected expected:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>

Original comment by Achi...@gmail.com on 24 Feb 2011 at 10:16

GoogleCodeExporter commented 9 years ago

it is done by LibXML::Reader. Yes you are right <,>,... should not come into 
Moses. However, strings like "<br><b>" are also not good for translation. I 
think this is problem of not-correctly created XLIFF. Such strings should be 
encapsulated by XLIFF's pair tag. If so, I'd leave is as it is for now ...

Original comment by xhu...@gmail.com on 25 Feb 2011 at 5:40

GoogleCodeExporter commented 9 years ago

I agree that these characters should already be correctly wrapped in XLIFF 
inline elements, but aren't these happening frequently (e.g. with XML parsed by 
WorldServer?).

If these charcters (<,>,[,],|,non-printing characters) stay in the input for 
Moses, the decoder will fail, so to avoid this we need to do something. Here 
are the options I see:
1.) escape the characters - yes, having tokens like "<br><b>" in the decoder 
input will not be great, but they will be handled as an unknown token and 
transferred unchanged to the target
2.) delete the tokens - but then post-editors would probably have a lot of work 
putting these back into the target
3.) handling these like XLIFF inline elements in the markup remover/reinserter 
- this could be technically impossible because there could be paired tags that 
don't have a corresponding tag in the same segment, even if it is possible this 
would mean lower quality of XLIFF inline element handling

So I would say if we don't see this happening often with XLIFF output by the 
major TMS systems, we do 2.) (make the -a option the default for the markup 
remover)
If they do happen frequently I think we should to 1.) - the escaping can happen 
in the markup remover, but should preferably happen in the modified tokenizer.

Original comment by Achi...@gmail.com on 27 Feb 2011 at 5:33

GoogleCodeExporter commented 9 years ago

hmm, unfortunately, there is plenty of badly created XLIFF content. (In 
commercial XLIFFes, I identify some 23% of lines affected by these  <,>,[,],| 
characters.

non-printing characters - well, characters like 0x7,... are also included in 
XLIFFes, however it means non-valid XML. Even Tikal would fail, because of 
non-valid XML. I'd suggest to leave as it is. Requirement of valid XML seems to 
me as useful. 

The special characters - need to be solved in a different manner.

Original comment by xhu...@gmail.com on 28 Feb 2011 at 10:50

GoogleCodeExporter commented 9 years ago

Resolving the problematic character issue for Moses with option 1.) mentioned 
above  by re-escaping the problematic characters in the markup remover. Still 
will enter separate issue to not introduce extra spaces in tags in the modified 
tokenizer.

Original comment by Achi...@gmail.com on 28 Feb 2011 at 9:36

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Remark: option 2.) removing non-inlineElement tags from above would not work 
because that would alter the token count between source and target and 
therefore confuse the markup reinserter.

Original comment by Achi...@gmail.com on 28 Feb 2011 at 9:55

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 9 Mar 2011 at 3:36

GoogleCodeExporter commented 9 years ago

Original comment by xhu...@gmail.com on 9 Mar 2011 at 3:48

Changed state: Verified

lvapeab / m4loc

No <,>,[,] or | or non-printing characters should be output by the tokenizer #3