tchwork / utf8

Portable and performant UTF-8, Unicode and Grapheme Clusters for PHP
Apache License 2.0
627 stars 50 forks source link

::filter doesn't work well #7

Closed jonnybarnes closed 11 years ago

jonnybarnes commented 11 years ago

This where someone tells me I'm doing this completely wrong, but given the following code

<?php
include("vendor/autoload.php");

use \Patchwork\Utf8 as u;

\Patchwork\Utf8\Bootup::initAll();
\Patchwork\Utf8\Bootup::filterRequestUri();
\Patchwork\Utf8\Bootup::filterRequestInputs();

header("Content-Type: text/html; charset=utf-8");

$txt = "Iñtërnâtiônàlizætiøn \xFC\xA1\xA1\xA1\xA1\xA1";

if(u::isUtf8($txt) != true) {
    $txt = u::filter($txt);
}

echo $txt;

I'd like the i18n word to be preserved. Instead the output is Iñtërnâtiônà lizætiøn ü¡¡¡¡¡. I'd like the output to be more like Iñtërnâtiônàlizætiøn.

nicolas-grekas commented 11 years ago

This is the expected behavior, but documentation lacks a bit...

I slightly updated the readme on this point, see the penultimate paragraph in the Usage section.

The reasoning is the following:

jonnybarnes commented 11 years ago

So just to clarify, and I don’t mean to sound like a prick, but the expected behaviour is that my perfectly encoded utf-8 word gets mangled when there is some trailing invalid utf-8 by the ::filter() method?

Having read S3.6.1 I can see why you wouldn't want to remove the invalid bytes. But why does Iñtërnâtiônàlizætiøn get turned into Iñtërnâtiônà lizætiøn?

Again, sorry if I'm coming across as a prick asking these questions?

nicolas-grekas commented 11 years ago

This is a tricky point, you are right to ask, no pb at all.

Your word is perfectly utf-8 valid, but the whole string is not, and u::filter() works by string. In your case, it checks if the full string is utf-8 valid, which is not the case. Then it assumes CP-1252 (this is also the choice of HTML5) and converts the string to UTF-8. This conversion does not see the "ñ" as a single char, but as two CP1252 bytes, which are converted to two utf-8 chars à then ±.

Do you have a real case where this string can come up in your data flow? No single browser behaves like that since years, so that doesn't happen it real life. But prove me wrong :)

jonnybarnes commented 11 years ago

I’m just playing around trying to understand how UTF-8 works and am writing a little script to hex-dump the byte values of a UTF-8 string: https://gist.github.com/jonnybarnes/6951138

So I suppose it’s not a real case of invalid utf-8 coming up in my data flow. And to be honest, other than manually creating some invalid utf-8 a la $invutf8 = "\xC0\xC1" I have no idea how one would paste invalid UTF-8 into the textarea. But I was thinking hypothetically if someone did.

If I set the default value of $txt to include some invalid bytes as well as the fancy i18n word then as I said above the whole word gets garbled.

But as you said, the only sensible way of dealing with an invalid UTF-8 string is to convert the characters into UTF-8, which is causing the valid portion of the string to get converted as well.

nicolas-grekas commented 11 years ago

I hope I answered you question. BTW, you should understand now that you shouldn't call isUtf8 before calling filter.

jonnybarnes commented 11 years ago

So would a decent workflow to be filtering inputs then if(!isUtf8($input) { throw an error }?

nicolas-grekas commented 11 years ago

filtering your input with u::filter() garanties that you will get utf-8, so exception will never ever be thrown

nicolas-grekas commented 11 years ago

In fact, this is what \Patchwork\Utf8\Bootup::filterRequestInputs(); does for all autoglobals ($_GET, $_POST, etc.)!

jonnybarnes commented 11 years ago

I was just about to say I'm using ::filterRequestInputs(). I love that in the test file you can manually construct the $_GET variable and the ::filterRequestInputs() method will still filter it.

Thanks for the help :)