Open GoogleCodeExporter opened 8 years ago
probably the simplest way is to replace in code ord(),chr(),strlen() with
unicode
versions, that can be added in util.php
function unistrlen ($s) {
return mb_strlen($s, 'utf-8');
}
function uniord($c) {
$h = ord($c{0});
if ($h <= 0x7F) {
return $h;
} else if ($h < 0xC2) {
return false;
} else if ($h <= 0xDF) {
return ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
} else if ($h <= 0xEF) {
return ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6
| (ord($c{2}) & 0x3F);
} else if ($h <= 0xF4) {
return ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12
| (ord($c{2}) & 0x3F) << 6
| (ord($c{3}) & 0x3F);
} else {
return false;
}
}
function unichr($u) {
return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
function strToIntArray($string){
$arr = array();
for($i=0,$n=mb_strlen($string, 'utf-8'); $i<$n; $i++)
$arr[] = uniord(mb_substr($string, $i, 1, 'utf-8'));
return $arr;
}
function charAt($str, $i){
return uniord(mb_substr($str, $i, 1, 'utf-8'));
}
Original comment by fleshcol...@gmail.com
on 11 Apr 2010 at 9:39
My work aroud was to transcode the string to Latin 1 before calling the parser
and lexer. I know that I'm loosing some characters of UTF-8 but accents are OK.
function parser($expr){
$expr = mb_convert_encoding($expr, 'ISO-8859-1', 'UTF-8');
$ass = new ANTLRStringStream($expr);
$lex = new DAMAaaSLexer($ass);
$cts = new CommonTokenStream($lex);
$par = new DAMAaaSParser($cts);
return $par;
}
Original comment by nicolas....@gmail.com
on 7 Nov 2011 at 10:46
@fleshcoloured : The modifications you advise are not enough.
In the parsing process, there are calls to strlen which gives the wrong length
for a string containing 2 bytes characters
Original comment by nicolas....@gmail.com
on 7 Nov 2011 at 10:48
Original issue reported on code.google.com by
fleshcol...@gmail.com
on 11 Apr 2010 at 8:15