Closed GoogleCodeExporter closed 9 years ago
done.
the result is:
<br><b> <x id="1"/> . Line <x id="2"/> , column <x id="3"/> </b><br>
special characters that aren't valid for moses are encoded like:
s/\[/[/g;
s/\]/]/g;
s/\|/|/g;
is it OK?
Original comment by xhu...@gmail.com
on 2 Mar 2011 at 5:48
Original comment by Achi...@gmail.com
on 9 Mar 2011 at 3:42
Original comment by xhu...@gmail.com
on 9 Mar 2011 at 3:49
Let me still verify this by running through the code.
Original comment by Achi...@gmail.com
on 9 Mar 2011 at 3:55
Line 79 in languagetool.xlf.en-us:
<br>Time: <x id="1"/>ms (including <x id="2"/>ms for rule matching)<br>
is tokenized to:
<br>Time: <x id="1"/> ms ( including <x id="2"/> ms for rule matching ) <br>
I would expect "<br> Time : [...]"
Also "&About..." and "&Open..." in the same file are tokenized to "&About..."
and "&Open...". Expected would be spaces between the & and the word and the
word and the "...". At least this behavior isn't consistent with words that do
not end in three periods.
Original comment by Achi...@gmail.com
on 14 Mar 2011 at 2:34
Remark: The unmodified Moses tokenizer (-l en option) tokenizes "&About..." as
"& About ..."
Original comment by Achi...@gmail.com
on 14 Mar 2011 at 2:45
Yes I know about this issue. It is a bit more complex.
It is done in this bad way only if the last word in the XML entities string
ends up with some non-word character - in this case it is ":" In the case of
&h;Open... it is "..."
The problem is with line #274
my @btag = split( /(&\w+;\S*)/i, $arr[$i] );
which is trying to delimit XML entities and normal word.
The main problem is that if the space between the last XML entity and first
normal would had not been put, the whole string would be taken as a XML entity
- this is quite often. Therefore I chose to put a space between the last XML
entity and first normal word.
This approach is working since a special non-word character is put behind the
normal word.
I'd propose leave it open for now and try to fix in some following stage.
According to my files - such cases are not so often. And if I would try to fix
it now - some other more common problems with those XML entities could come to
existence.
Original comment by xhu...@gmail.com
on 14 Mar 2011 at 11:40
Original comment by Achi...@gmail.com
on 6 Sep 2013 at 5:01
For now escaped HTML/XML character entities should not be unescaped and handled
as separate tokens.
See also related issue 47.
Original comment by Achi...@gmail.com
on 24 Sep 2013 at 4:03
Original issue reported on code.google.com by
Achi...@gmail.com
on 28 Feb 2011 at 9:44