semsol / arc2

ARC RDF Classes for PHP
Other
332 stars 89 forks source link

Encoding issue in the NTriplesSerializer #67

Open coreation opened 9 years ago

coreation commented 9 years ago

Hi,

I figured out a problem with escape function

Problem: André -> the é is nicely escaped with \u00E9 (iirc) Andréé -> the éé is replaced with \uAAA9 ( a square character)

Now what I found through debugging is that when putting through characters with the preg_replace_callback, the "éé" sequence is seen as 1 character, even with the mb_strlen functionality. If I however comment the line where you utf8_decode a string on the second line of the escape function, this "éé" sequence is done properly with two \u00E9 sequences.

My guess is that the utf8_decode unwillingly decodes a good utf8-string (why in the first place is this necessary?) and this messes up the mb_strlen, where utf-8 is given as the character encoding, yet the string is now ISO-8859-1 through the utf8_decode...