Closed jonnybarnes closed 11 years ago
This is the expected behavior, but documentation lacks a bit...
I slightly updated the readme on this point, see the penultimate paragraph in the Usage section.
The reasoning is the following:
So just to clarify, and I don’t mean to sound like a prick, but the expected behaviour is that my perfectly encoded utf-8 word gets mangled when there is some trailing invalid utf-8 by the ::filter()
method?
Having read S3.6.1 I can see why you wouldn't want to remove the invalid bytes. But why does Iñtërnâtiônàlizætiøn
get turned into Iñtërnâtiônà lizætiøn
?
Again, sorry if I'm coming across as a prick asking these questions?
This is a tricky point, you are right to ask, no pb at all.
Your word is perfectly utf-8 valid, but the whole string is not, and u::filter() works by string. In your case, it checks if the full string is utf-8 valid, which is not the case. Then it assumes CP-1252 (this is also the choice of HTML5) and converts the string to UTF-8. This conversion does not see the "ñ" as a single char, but as two CP1252 bytes, which are converted to two utf-8 chars à then ±.
Do you have a real case where this string can come up in your data flow? No single browser behaves like that since years, so that doesn't happen it real life. But prove me wrong :)
I’m just playing around trying to understand how UTF-8 works and am writing a little script to hex-dump the byte values of a UTF-8 string: https://gist.github.com/jonnybarnes/6951138
So I suppose it’s not a real case of invalid utf-8 coming up in my data flow. And to be honest, other than manually creating some invalid utf-8 a la $invutf8 = "\xC0\xC1"
I have no idea how one would paste invalid UTF-8 into the textarea. But I was thinking hypothetically if someone did.
If I set the default value of $txt
to include some invalid bytes as well as the fancy i18n word then as I said above the whole word gets garbled.
But as you said, the only sensible way of dealing with an invalid UTF-8 string is to convert the characters into UTF-8, which is causing the valid portion of the string to get converted as well.
I hope I answered you question. BTW, you should understand now that you shouldn't call isUtf8 before calling filter.
So would a decent workflow to be filtering inputs then if(!isUtf8($input) { throw an error }
?
filtering your input with u::filter() garanties that you will get utf-8, so exception will never ever be thrown
In fact, this is what \Patchwork\Utf8\Bootup::filterRequestInputs(); does for all autoglobals ($_GET
, $_POST
, etc.)!
I was just about to say I'm using ::filterRequestInputs()
. I love that in the test file you can manually construct the $_GET
variable and the ::filterRequestInputs()
method will still filter it.
Thanks for the help :)
This where someone tells me I'm doing this completely wrong, but given the following code
I'd like the i18n word to be preserved. Instead the output is
Iñtërnâtiônà lizætiøn ü¡¡¡¡¡
. I'd like the output to be more likeIñtërnâtiônàlizætiøn
.