Closed GoogleCodeExporter closed 9 years ago
Corrected expected:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>
Original comment by Achi...@gmail.com
on 24 Feb 2011 at 10:16
it is done by LibXML::Reader. Yes you are right <,>,... should not come into
Moses. However, strings like "<br><b>" are also not good for translation. I
think this is problem of not-correctly created XLIFF. Such strings should be
encapsulated by XLIFF's pair tag. If so, I'd leave is as it is for now ...
Original comment by xhu...@gmail.com
on 25 Feb 2011 at 5:40
I agree that these characters should already be correctly wrapped in XLIFF
inline elements, but aren't these happening frequently (e.g. with XML parsed by
WorldServer?).
If these charcters (<,>,[,],|,non-printing characters) stay in the input for
Moses, the decoder will fail, so to avoid this we need to do something. Here
are the options I see:
1.) escape the characters - yes, having tokens like "<br><b>" in the decoder
input will not be great, but they will be handled as an unknown token and
transferred unchanged to the target
2.) delete the tokens - but then post-editors would probably have a lot of work
putting these back into the target
3.) handling these like XLIFF inline elements in the markup remover/reinserter
- this could be technically impossible because there could be paired tags that
don't have a corresponding tag in the same segment, even if it is possible this
would mean lower quality of XLIFF inline element handling
So I would say if we don't see this happening often with XLIFF output by the
major TMS systems, we do 2.) (make the -a option the default for the markup
remover)
If they do happen frequently I think we should to 1.) - the escaping can happen
in the markup remover, but should preferably happen in the modified tokenizer.
Original comment by Achi...@gmail.com
on 27 Feb 2011 at 5:33
hmm, unfortunately, there is plenty of badly created XLIFF content. (In
commercial XLIFFes, I identify some 23% of lines affected by these <,>,[,],|
characters.
non-printing characters - well, characters like 0x7,... are also included in
XLIFFes, however it means non-valid XML. Even Tikal would fail, because of
non-valid XML. I'd suggest to leave as it is. Requirement of valid XML seems to
me as useful.
The special characters - need to be solved in a different manner.
Original comment by xhu...@gmail.com
on 28 Feb 2011 at 10:50
Resolving the problematic character issue for Moses with option 1.) mentioned
above by re-escaping the problematic characters in the markup remover. Still
will enter separate issue to not introduce extra spaces in tags in the modified
tokenizer.
Original comment by Achi...@gmail.com
on 28 Feb 2011 at 9:36
Remark: option 2.) removing non-inlineElement tags from above would not work
because that would alter the token count between source and target and
therefore confuse the markup reinserter.
Original comment by Achi...@gmail.com
on 28 Feb 2011 at 9:55
Original comment by Achi...@gmail.com
on 9 Mar 2011 at 3:36
Original comment by xhu...@gmail.com
on 9 Mar 2011 at 3:48
Original issue reported on code.google.com by
Achi...@gmail.com
on 24 Feb 2011 at 10:09