Add setting to disable UTF-8 encoding in Websocket::Frame

cavac commented 9 years ago

There are two problems with automatic encoding/decoding UTF-8

First of all, this doesn't make any sense on BINARY packets, so these should never handled as possible UTF-8 encoded strings (i failed with exactly this kind of error when trying to implement a proxy for noVNC).

Secondly, on TEXT type messages, UTF-8 could use either automatic mode (default) or off. This is because i might want to pass the messages through and handle encoding later.

vti commented 9 years ago

Can you give an example?

cavac commented 9 years ago

Yes i could, but it's convoluted and not easy to set up, because it's integrated into my framework.

I just quick-hacked an untested patch to Frame.pm. It should default to the existing behaviour for text, unless you set "disable_utf8 => 1" on new(). On binary packets, it skips UTF-8 encoding at all.

I think the problem i had a few months back (just refactoring my old code) is that Encode::is_utf8() sometimes returns true even when Encode::encode() fails. Also, Encode::decode('UTF-8', $bytes) WILL fail if $bytes aren't UTF-8.

As said, i'm using this (among other things) to exchange binary packets with https://kanaka.github.io/noVNC/. I'm only using Websocket::Frame, but have my own implementation of a webserver.

Github is probably butchering my copy/paste of the diff, so you can also download it from https://www.cavac.at/admin/pimenu/download/websocket_frame_utf8.patch

--- Frame.pm_orig 2015-02-20 19:10:14.846844164 +0100 +++ Frame.pm 2015-02-20 19:27:06.815862247 +0100 @@ -37,16 +37,26 @@

 $buffer = '' unless defined $buffer;

if (Encode::is_utf8($buffer)) {
if (defined($self->{type}) && defined($TYPES{$self->{type}})) {
$self->opcode($TYPES{$self->{type}});
} else {
$self->opcode($TYPES{'text'});
} +
my $disable_utf8 = 0; # Default: UTF-8 encoding is enabled
if(defined($self->{disable_utf8})) {
$disable_utf8 = $self->{disable_utf8};
} else {
$self->{disable_utf8} = $disable_utf8;
} +
if (!$disable_utf8 && $self->{type} eq 'text' && Encode::is_utf8($buffer)) { $self->{buffer} = Encode::encode('UTF-8', $buffer); } else { $self->{buffer} = $buffer; }
if (defined($self->{type}) && defined($TYPES{$self->{type}})) {
$self->opcode($TYPES{$self->{type}});
}

$self->{version} ||= 'draft-ietf-hybi-17';

@@ -81,7 +91,11 @@ my $bytes = $self->next_bytes; return unless defined $bytes;

return Encode::decode('UTF-8', $bytes);
if(!$self->{disable_utf8} && $self->opcode eq $TYPES{'text'}) {
return Encode::decode('UTF-8', $bytes);
} +
return $bytes; }

sub fin {

vti commented 9 years ago

If you have binary data, is_utf8 should return false. To skip decoding use next_bytes.

cavac commented 9 years ago

Ah, ok. But the POD says that next_bytes does "Return the next message as a UTF-8 encoded string."

I can see in the source that it obviously doesn't do that (next() does the encoding).

vti commented 9 years ago

Hm, that's an error! I will fix that :)

cavac commented 9 years ago

What i meant is, the POD should say next_bytes() returns whatever data is in the queue in it's original form (sort of like binmode() on sockets), whereas next() does UTF-8 decoding.

cavac commented 9 years ago

On a side note: Other than misunderstanding the UTF-8 issue, i'm very happy with your module and use it quite a lot in my projects. And yes, it's one of those "mission critical" web applications at work.

Thanks for your effort!

vti commented 9 years ago

I have also plans to split ::Frame into two classes: one for frame reading with next_frame method and another one just for building. Now it's a mess really.

cavac commented 9 years ago

You can contact me via email (see my profile). I'm a willing beta tester and early adopter.

And believe me, nothing get's bugs fixed faster than a hundred complaining coworkers ;-)

rjbs commented 7 years ago

You can't rely on is_utf8 to be false on binary data. There are real happens-in-actual-life examples where will be true. Relying on is_utf8 as anything other than "how is Perl representing this internally?" is a real bug.

Personally, I suggest you document that text frames expect character strings, and only and always encode those. The current behavior is going to cause problems at some point.

leonerd commented 6 years ago

I concur. The behaviour really ought to be that the data for any text frame is encoded and any other frame is not. This can easily be achieved by

$self->{buffer} = $self->is_text ? Encode::encode_utf8($buffer) : $buffer;

This will work provided it's done after the opcode field is set.

Shall I submit a PR to implement this?

vti commented 6 years ago

That could work.

leonerd commented 6 years ago

https://github.com/vti/protocol-websocket/pull/52

leonerd commented 6 years ago

With #52 merged, I believe this issue is probably safe to close now?

vti commented 6 years ago

Released v0.22.

wolfsage commented 6 years ago

So this broke AnyEvent, and some other things, so it was rolled back. But it looks like another small patch needed to go with this (from @leonerd, on https://github.com/plicease/AnyEvent-WebSocket-Client/issues/33)

Can we get those two pulled in together and have a trial release to fix this encoding issue once and for all?

@vti ?

Thanks

vti / protocol-websocket

Add setting to disable UTF-8 encoding in Websocket::Frame #35