lgyers / firephp

Automatically exported from code.google.com/p/firephp
0 stars 0 forks source link

Truncate dump when string have non utf8 cars #45

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
make an array or object with a string "mon numéro est le 0"

Dump will truncate the string before the "é".

This is because native json_dump accept only UTF8 string and sources are
not always encoded in UTF8.

Original issue reported on code.google.com by tit...@gmail.com on 10 Oct 2008 at 4:50

GoogleCodeExporter commented 9 years ago
workaround:
At end of encodeObject
Replace 
    } else {
      return $Object;
    }
    return $return;
  }
With
    } else {
      return utf8_encode($Object);
    }
    return $return;
  }

Original comment by tit...@gmail.com on 10 Oct 2008 at 4:58

GoogleCodeExporter commented 9 years ago
I have tested this with 0.2.b.2 (extension) 0.2.b.4 (php lib) and everything is
working as expected.

Also the fix you are proposing will not work as utf8_encode will only work with
strings and not arrays.

Please re-test and send me a test case including a spec for your environment.

Original comment by christ...@christophdorn.com on 14 Oct 2008 at 1:36

GoogleCodeExporter commented 9 years ago
I have same version than you.
The fix i give works for me but because problem was on leaf in my case, so 
$object
was probably a string, but your right, it'll failed in other case.

You can find a test file, just run it, take care to not save or edit it as your
editor could change the file encoding (you are probably using utf8 as default) 
but we
are using ISO-8859-1

My spec tests was windows or solaris, php5.2.4 on apache 2.2.6.

Here the output i got
{"index":"mon num","mon num":"index"}

Original comment by tit...@gmail.com on 14 Oct 2008 at 6:12

Attachments:

GoogleCodeExporter commented 9 years ago
This seems to be an issue with the JSON PHP extension not FirePHP. You can file 
a bug
report here:
http://pecl.php.net/bugs/search.php?cmd=display&status=Open&package_name[]=json

It looks like there is already a bug for this: 
http://pecl.php.net/bugs/bug.php?id=10083

Original comment by christ...@christophdorn.com on 15 Oct 2008 at 12:35

GoogleCodeExporter commented 9 years ago
If you can send me a working patch for FirePHPCore that respects objects, 
arrays and
strings I can add an option to the library to force use of the included JSON 
encoder
(instead of json_encode) to get this working.

Original comment by christ...@christophdorn.com on 15 Oct 2008 at 10:42

GoogleCodeExporter commented 9 years ago
I have implemented a fix for this. The patch you submitted actually works 
properly.

In the next beta you will be able to set the "useNativeJsonEncode" option to 
"false"
to bypass json_encode() and use the included encoder with the fix.

Original comment by christ...@christophdorn.com on 16 Oct 2008 at 9:35

GoogleCodeExporter commented 9 years ago
i have upgrade to last firephp (b4) and firecore (b7).
It s ok, bug is fixed.

I know the regex way is the one suggested by
http://w3.org/International/questions/qa-forms-utf-8.html but it's slow, 
especially
when string is long and false token is around the end of string.
a test like (utf8_encode(utf8_decode($string)) == $string) is really faster 
The fastest way i found is return strpos(utf8_encode($string),chr(131),0) !== 
false

I did my test on on sparc solaris or core 2 duo xp.

Original comment by tit...@gmail.com on 17 Oct 2008 at 9:33

GoogleCodeExporter commented 9 years ago
Is this reliable 100% of the case?

strpos(utf8_encode($string),chr(131),0) !== false

Original comment by christ...@christophdorn.com on 17 Oct 2008 at 5:22

GoogleCodeExporter commented 9 years ago
The above test failed my tests. I am now using:

  protected static function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
  } 

Original comment by christ...@christophdorn.com on 17 Oct 2008 at 7:21

GoogleCodeExporter commented 9 years ago
Please test 0.2.b.8 to ensure all is working now.

Original comment by christ...@christophdorn.com on 17 Oct 2008 at 7:47

GoogleCodeExporter commented 9 years ago
Works ok, but this function seems even slower than the regex one.

This is my test
$sentence = "ceci est une tres longue phrase avec un accent uniquement a la 
fin: olé"
$sentence_utf8 = utf8_encode("ceci est une tres longue phrase avec un accent
uniquement a la fin: olé")

I call 100 times each utf8 function on both sentence, i check time for false 
and true
case.

utf8_encode(utf8_decode($s)) == $s:
false: 1.24ms
true:  1.27ms

regexp:
false: 12.31ms
true:  4.34ms

if else ord (the current in b8) :
false: 21.33ms
true:  21.45ms

Original comment by tit...@gmail.com on 20 Oct 2008 at 9:10

GoogleCodeExporter commented 9 years ago
Thanks for the benchmarks. If you can find a faster solution that is reliable 
let me
know. Until then I am going to use the current implementation.

This is a debug tool so I am not too concerned about performance at this time.

Original comment by christ...@christophdorn.com on 20 Oct 2008 at 5:49

GoogleCodeExporter commented 9 years ago
isn't the utf8_encode(utf8_decode($s)) reliable ?

I agree that a debug tool, but i was logging a lot and long string during a 
session.
And this lead me to max cpu time, that s why i try to find out what was eating 
cpu.
This is only why i search a fast algo.

Original comment by tit...@gmail.com on 21 Oct 2008 at 7:12

GoogleCodeExporter commented 9 years ago
Not in my tests. No.

Original comment by christ...@christophdorn.com on 21 Oct 2008 at 6:39

GoogleCodeExporter commented 9 years ago

Original comment by christ...@christophdorn.com on 22 Oct 2008 at 5:13

GoogleCodeExporter commented 9 years ago
Could you check your ISO-8859-1 encoded files to see if FirePHP works properly 
with
the "useNativeJsonEncode" option to TRUE and FALSE.

Original comment by christ...@christophdorn.com on 3 Nov 2008 at 8:53

GoogleCodeExporter commented 9 years ago
I think there is something misunderstood.
I just read http://www.firephp.org/HQ/Use.htm about "Options". You said
useNativeJsonEncode should be set to FALSE for ISO-8859-1. This is not true. I 
never
had set this option to false as i want to use the php native json_encode for
performance. 
It works fine since you are encoding non-utf8 string to utf8 in encodeObject()
This is not specific to iso-8859, same problem will occur with any encoding 
that have
accents (German, Spanish,). All chars that are not common to utf8 table should 
utf8
encoded.

Anyway, i did some tests this morning with this useNativeJsonEncode set to 
false.
Everything works fine as when it s set to true. It s just 4 time slower for 
same set
of tests. 80ms for native true vs 240ms for native false.

Could you give me the tests you did to failed utf8_encode(utf8_decode($s)) or 
strpos(utf8_encode($string),chr(131),0). 
I find the last one in a forum, and i don't really dig to check why chr(131).
But i don't understand how the encode/decode could failed.

Original comment by tit...@gmail.com on 4 Nov 2008 at 7:30

GoogleCodeExporter commented 9 years ago
Great. I did not think the useNativeJsonEncode option affected the encoding any 
more.
It did in the beginning. But I wanted to make sure. I will update the 
documentation.

As for utf8_encode(utf8_decode($s)). They are complimentary. They will decode 
and
encode even UTF-8 strings. At least that is what happened in my tests. These 
are my
test files:

http://code.google.com/p/firephp/source/browse/trunk/DevApp/application/tests/Se
rverLibraries/FirePHPCore/UTF8.php

http://code.google.com/p/firephp/source/browse/trunk/DevApp/application/bootstra
p/plain/Check-ISO-8859-1.php

The second file you need to load with ISO-8859-1 encoding into your editor.

Original comment by christ...@christophdorn.com on 5 Nov 2008 at 7:31

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
I confirm that all other methods included the first regex one failed.
Only actual pass all my tests.

I find a new one that pass all my tests.
Would you tell me if this function works for you.

function is_utf8($s) {
   return preg_match('/./u',$s) > 0;
}

because of /u, preg_match will exit if test string contains non utf8 cars, then
return no match

Original comment by tit...@gmail.com on 6 Nov 2008 at 12:41

GoogleCodeExporter commented 9 years ago
I'll take a look at this solution. Thanks!

Original comment by christ...@christophdorn.com on 6 Nov 2008 at 8:31

GoogleCodeExporter commented 9 years ago
Have you looked into using the multibyte string functions?

http://us.php.net/manual/en/ref.mbstring.php

Original comment by christ...@christophdorn.com on 7 Nov 2008 at 6:26

GoogleCodeExporter commented 9 years ago
I think about this lib but i give up because it's not part of php core.
It would lead to a dependence to mbstring extension and this would be not a good
thing for firephp.

Original comment by tit...@gmail.com on 7 Nov 2008 at 7:19

GoogleCodeExporter commented 9 years ago

Original comment by christ...@christophdorn.com on 23 Mar 2009 at 12:12