ocean2706 / antlrphpruntime

Automatically exported from code.google.com/p/antlrphpruntime
0 stars 0 forks source link

multibyte characters support #5

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Multibyte characters cannot be recognized in input stream (for example,
chars with accents: é, à).

(Used version: 0.0.3)

Possible solution:

- Do not use "$string[$index]" to obtain char
- Use "mb_substr($string, $index, 1, 'utf-8')" instead. (In that case input
should be converted to UTF-8 at first.)

Additional check for multibyte extension presence can be done:
<pre>
if (function_exists('mb_substr'))
  $char = mb_substr($string, $index, 1, 'utf-8');
else
  $char = $string[$index];
</pre>

Original issue reported on code.google.com by fleshcol...@gmail.com on 11 Apr 2010 at 8:15

GoogleCodeExporter commented 9 years ago
probably the simplest way is to replace in code ord(),chr(),strlen() with 
unicode
versions, that can be added in util.php

    function unistrlen ($s) {
        return mb_strlen($s, 'utf-8');
    }

    function uniord($c) {
            $h = ord($c{0});
            if ($h <= 0x7F) {
                    return $h;
            } else if ($h < 0xC2) {
                    return false;
            } else if ($h <= 0xDF) {
                    return ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
            } else if ($h <= 0xEF) {
                    return ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6
                                                                     | (ord($c{2}) & 0x3F);
            } else if ($h <= 0xF4) {
                    return ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12
                                                                     | (ord($c{2}) & 0x3F) << 6
                                                                     | (ord($c{3}) & 0x3F);
            } else {
                    return false;
            }
    }

    function unichr($u) {
            return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
    }   

    function strToIntArray($string){
        $arr = array();

        for($i=0,$n=mb_strlen($string, 'utf-8'); $i<$n; $i++)
            $arr[] = uniord(mb_substr($string, $i, 1, 'utf-8'));

        return $arr;
    }

    function charAt($str, $i){
        return uniord(mb_substr($str, $i, 1, 'utf-8'));
    }

Original comment by fleshcol...@gmail.com on 11 Apr 2010 at 9:39

GoogleCodeExporter commented 9 years ago
My work aroud was to transcode the string to Latin 1 before calling the parser 
and lexer. I know that I'm loosing some characters of UTF-8 but accents are OK.

function parser($expr){
        $expr = mb_convert_encoding($expr, 'ISO-8859-1', 'UTF-8');
        $ass = new ANTLRStringStream($expr); 
        $lex = new DAMAaaSLexer($ass);
        $cts = new CommonTokenStream($lex);
        $par = new DAMAaaSParser($cts);
        return $par;
}

Original comment by nicolas....@gmail.com on 7 Nov 2011 at 10:46

GoogleCodeExporter commented 9 years ago
@fleshcoloured : The modifications you advise are not enough.

In the parsing process, there are calls to strlen which gives the wrong length 
for a string containing 2 bytes characters

Original comment by nicolas....@gmail.com on 7 Nov 2011 at 10:48