lvapeab / m4loc

Automatically exported from code.google.com/p/m4loc
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Tokenizer introduces extra spaces with unescaped XML character entities #5

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
1. tikal.sh -xm ./t/languagetool.xlf
2. perl mod_tokenizer.pl -l en-us < languagetool.xlf.en-us > 
languagetool.xlf.tok.en-us

Compare line 77 of
languagetool.xlf.en-us:
&lt;br>&lt;b> <x id="1"/>. Line <x id="2"/>, column <x id="3"/>&lt;/b>&lt;br>

languagetool.xlf.tok.en-us:
< br > < b > <x id="1"/> . Line <x id="2"/> , column <x id="3"/> < / b > < br >

Expected output:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>
or
<br> <b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b> <br>

The extra spaces make it hard to distinguish tags from < and > characters in 
sentences like "The temperature is < 12 degrees, but > than 6."
As separate tokens "b" and "br" might be translated into other characters which 
would break the markup.

Original issue reported on code.google.com by Achi...@gmail.com on 28 Feb 2011 at 9:44

GoogleCodeExporter commented 9 years ago
done. 
the result is:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>

special characters that aren't valid for moses are encoded like:
s/\[/[/g;
s/\]/]/g;
s/\|/|/g;

is it OK?

Original comment by xhu...@gmail.com on 2 Mar 2011 at 5:48

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 9 Mar 2011 at 3:42

GoogleCodeExporter commented 9 years ago

Original comment by xhu...@gmail.com on 9 Mar 2011 at 3:49

GoogleCodeExporter commented 9 years ago
Let me still verify this by running through the code.

Original comment by Achi...@gmail.com on 9 Mar 2011 at 3:55

GoogleCodeExporter commented 9 years ago
Line 79 in languagetool.xlf.en-us:
<br>Time: <x id="1"/>ms (including <x id="2"/>ms for rule matching)<br>
is tokenized to:
<br>Time: <x id="1"/> ms ( including <x id="2"/> ms for rule matching ) <br>
I would expect "<br> Time : [...]"

Also "&About..." and "&Open..." in the same file are tokenized to "&About..." 
and "&Open...". Expected would be spaces between the & and the word and the 
word and the "...". At least this behavior isn't consistent with words that do 
not end in three periods.

Original comment by Achi...@gmail.com on 14 Mar 2011 at 2:34

GoogleCodeExporter commented 9 years ago
Remark: The unmodified Moses tokenizer (-l en option) tokenizes "&About..." as 
"& About ..."

Original comment by Achi...@gmail.com on 14 Mar 2011 at 2:45

GoogleCodeExporter commented 9 years ago
Yes I know about this issue. It is a bit more complex.
It is done in this bad way only if the last word in the XML entities string 
ends up with some non-word character - in this case it is ":" In the case of 
&h;Open... it is "..."

The problem is with line #274
my @btag = split( /(&\w+;\S*)/i, $arr[$i] );

which is trying to delimit XML entities and normal word. 
The main problem is that if the space between the last XML entity and first 
normal would had not been put, the whole string would be taken as a XML entity 
- this is quite often. Therefore I chose to put a space between the last XML 
entity and first normal word.

This approach is working since a special non-word character is put behind the 
normal word.

I'd propose leave it open for now and try to fix in some following stage. 
According to my files - such cases are not so often. And if I would try to fix 
it now - some other more common problems with those XML entities could come to 
existence.

Original comment by xhu...@gmail.com on 14 Mar 2011 at 11:40

GoogleCodeExporter commented 9 years ago

Original comment by Achi...@gmail.com on 6 Sep 2013 at 5:01

GoogleCodeExporter commented 9 years ago
For now escaped HTML/XML character entities should not be unescaped and handled 
as separate tokens.
See also related issue 47.

Original comment by Achi...@gmail.com on 24 Sep 2013 at 4:03