Closed cavac closed 6 years ago
Can you give an example?
Yes i could, but it's convoluted and not easy to set up, because it's integrated into my framework.
I just quick-hacked an untested patch to Frame.pm. It should default to the existing behaviour for text, unless you set "disable_utf8 => 1" on new(). On binary packets, it skips UTF-8 encoding at all.
I think the problem i had a few months back (just refactoring my old code) is that Encode::is_utf8() sometimes returns true even when Encode::encode() fails. Also, Encode::decode('UTF-8', $bytes) WILL fail if $bytes aren't UTF-8.
As said, i'm using this (among other things) to exchange binary packets with https://kanaka.github.io/noVNC/. I'm only using Websocket::Frame, but have my own implementation of a webserver.
Github is probably butchering my copy/paste of the diff, so you can also download it from https://www.cavac.at/admin/pimenu/download/websocket_frame_utf8.patch
--- Frame.pm_orig 2015-02-20 19:10:14.846844164 +0100 +++ Frame.pm 2015-02-20 19:27:06.815862247 +0100 @@ -37,16 +37,26 @@
$buffer = '' unless defined $buffer;
}
$self->{version} ||= 'draft-ietf-hybi-17';
@@ -81,7 +91,11 @@ my $bytes = $self->next_bytes; return unless defined $bytes;
return $bytes; }
sub fin {
If you have binary data, is_utf8
should return false. To skip decoding use next_bytes
.
Ah, ok. But the POD says that next_bytes does "Return the next message as a UTF-8 encoded string."
I can see in the source that it obviously doesn't do that (next() does the encoding).
Hm, that's an error! I will fix that :)
What i meant is, the POD should say next_bytes() returns whatever data is in the queue in it's original form (sort of like binmode() on sockets), whereas next() does UTF-8 decoding.
On a side note: Other than misunderstanding the UTF-8 issue, i'm very happy with your module and use it quite a lot in my projects. And yes, it's one of those "mission critical" web applications at work.
Thanks for your effort!
I have also plans to split ::Frame
into two classes: one for frame reading with next_frame
method and another one just for building. Now it's a mess really.
You can contact me via email (see my profile). I'm a willing beta tester and early adopter.
And believe me, nothing get's bugs fixed faster than a hundred complaining coworkers ;-)
You can't rely on is_utf8
to be false on binary data. There are real happens-in-actual-life examples where will be true. Relying on is_utf8
as anything other than "how is Perl representing this internally?" is a real bug.
Personally, I suggest you document that text frames expect character strings, and only and always encode those. The current behavior is going to cause problems at some point.
I concur. The behaviour really ought to be that the data for any text
frame is encoded and any other frame is not. This can easily be achieved by
$self->{buffer} = $self->is_text ? Encode::encode_utf8($buffer) : $buffer;
This will work provided it's done after the opcode
field is set.
Shall I submit a PR to implement this?
That could work.
With #52 merged, I believe this issue is probably safe to close now?
Released v0.22.
So this broke AnyEvent, and some other things, so it was rolled back. But it looks like another small patch needed to go with this (from @leonerd, on https://github.com/plicease/AnyEvent-WebSocket-Client/issues/33)
Can we get those two pulled in together and have a trial release to fix this encoding issue once and for all?
@vti ?
Thanks
There are two problems with automatic encoding/decoding UTF-8
First of all, this doesn't make any sense on BINARY packets, so these should never handled as possible UTF-8 encoded strings (i failed with exactly this kind of error when trying to implement a proxy for noVNC).
Secondly, on TEXT type messages, UTF-8 could use either automatic mode (default) or off. This is because i might want to pass the messages through and handle encoding later.