miyagawa / AnyEvent-Redis

Asynchronous Redis client
http://search.cpan.org/dist/AnyEvent-Redis
41 stars 16 forks source link

Update to last pull request #9

Closed otterley closed 13 years ago

otterley commented 13 years ago

Added some fixes to ensure UTF-8 data is sent properly. Also added a test case.

dgl commented 13 years ago

I'm not quite sure I like doing UTF-8 like that. It's sending without decoding (so could also be sending latin1) then calling utf8::decode, which could go wrong. (Plus p5p treat 'use bytes' as deprecated and it should be avoided really).

It feels like maybe a connection attribute that specifies if data in Redis is UTF-8 or binary may be an idea here?

otterley commented 13 years ago

The main idea is to ensure that the binary data is stored correctly regardless of the specific encoding of the variable in which it is stored. Currently the code fails when 'use utf8;' is used, so at least it's an improvement there.

I'm unaware of any way other than 'use bytes;' to get length() to return the number of bytes of a scalar whose encoding you don't know in advance. It may well be that it's officially discouraged, but it still works as of 5.12. In my view, "works every time" always beats "works only if you tell us the encoding."

I'm not 100% sold on the connection attribute idea. It might work for some, but it might also be a showstopper for those who want to store raw binary, UTF-8 and latin1 in the same database. The caller, on the other hand, should usually know the encoding of a specific value, and so I think it makes the most sense to leave it with the responsibility.

dgl commented 13 years ago

I don't really like the approach of relying on Perl's internal encoding, you can't always easily know what it is ("\x{a3}" -- is your string UTF-8 or Latin1?). If the user wants to store binary data they should encode it as bytes themselves (i.e. without the UTF-8 flag on).

I'd also argue that the majority of users don't want to be storing a mixture of character sets (and also don't want to be thinking about encodings on each call to Redis), so a flag saying "I'm using UTF-8" would be good for the general case.

(Sorry to be slightly obtuse, I know the code didn't handle this well already but it's been something I've thinking should be fixed somehow. Your read handler changes in general look fine.)

otterley commented 13 years ago

The very reason I used 'use bytes' is that it makes the encoding a non-issue - I'm using it here for the sole purpose of obtaining an accurat count of the number of bytes associated with the scalar. There's no question that the interpreter knows what that count is, irrespective of the encoding.

I agree that if users going to store a mixture of UTF-8 and binary (and whatever-encoded) data that it would probably be best for users to encode it as bytes first. And I don't see any harm in making a per-connection encoding preference available, as long we also provide a way to send and receive raw binary data. I just don't want to make it mandatory because users often have legacy issues.

dgl commented 13 years ago

IMHO using use bytes actually makes encoding an issue when it shouldn't be. You're then caring about how Perl internally handles strings.

i.e. if you do something like:

$a="\x{a3}\x{acd}";
$a =~ s/.$//;
$redis->set($a, 42)->recv;

Then do:

$redis->get("\x{a3}")->recv;

One of the strings you use as the key will have the UTF-8 flag, one won't. So relying on Perl's internal encoding is a bad thing.

What I'm suggesting is something like: encoding => "UTF-8" or encoding => "binary"

UTF-8 means AnyEvent::Redis calls decode_utf8 / encode_utf8 on the strings it sees.

binary means AnyEvent::Redis doesn't, but the behaviour is undefined if the user passes a scalar with the UTF-8 flag set.

(Default probably being UTF-8)

otterley commented 13 years ago

I just wrote a test case using your example and it worked just fine. Note that I did not 'use utf8;' here.

use strict;
use Test::More;
use t::Redis;

test_redis {
    my $r = shift;

    $r->all_cv->begin(sub { $_[0]->send });

    my $info = $r->info->recv;
    is ref $info, 'HASH';
    ok $info->{redis_version};

    my $a="\x{a3}\x{acd}";
    $a =~ s/.$//;
    $r->set($a, 42, sub { pass "SET $a" });

    $r->get($a, sub { 
            is $_[0], "42" 
    });

    $r->all_cv->end;
    $r->all_cv->recv;
};

done_testing;

So, what's the problem? If you can find a bug with the implementation, by all means give me a test case.

dgl commented 13 years ago

If I change that test to not reuse $a:

$r->get("\x{a3", ...

then it fails. That's what I'm talking about.

(Also I should point out 'use utf8' has no impact on \x escapes in your source code, that's only for writing the code in actual UTF-8.)

miyagawa commented 13 years ago

t/utf8.t seems wrong to me - it sets decoded strings (because use utf8; is in effect) on SET, but GET seems to return utf8 bytes which is decoded by utf8::decode. At least it's asymmetric.

However, it is true that in Perl there's absolutely no reliable way to tell if a given scalar variable is a bytes or a string, and calling stuff like utf8::encode or utf8::downgrade causes issues when you want to just store binary. Using "use bytes" seems wrong but utf8::length here seems like a necessary evil because there's no other way to get the byte length - assuming if you store wide Unicode characters you will definitely get "Wide character in print" warning when it's actually transmit over the socket.

otterley commented 13 years ago

Other than 'use bytes' - which we know works for all encodings - what solution do you recommend that will reliably ensure that the data will be stored as byte sequences in Redis correctly, regardless of encoding?

Is the only alternative to force the user to declare the encoding when instantiating a connection object?

dgl commented 13 years ago

I think the specifying the encoding is the only thing that will work.

(For something with a similar issue see the DBD::SQLite docs on the sqlite_unicode connection flag; Redis's "transparent" handling of data is very similar to SQLite's (lack of) type system in a way.)

miyagawa commented 13 years ago

I think there's no reliable way to just magically work. Our options are:

a) Requires users to encode all strings into byte strings first. AE::Redis treats everything as binary and does no magic. Should be the most "reliable" solution but requires user code to do encoding/decoding.

b) Handle everything as Unicode strings on perl and utf8 bytes on wire. This makes the library and user code simplest, BUT x) requires latin-1 literals be upgraded with utf8::upgrade($_) in places and y) breaks storing arbitrary binary data that is non-UTF8.

c) Like dql suggest, accept the parameter (like most DBI handler actually does) to set encoding, with the 'binary' exception, which behaves exactly the same as a).

The current code (before your patch) does a), and your patch makes somewhat in the middle of a) and b) - only to get the "right" byte length in case the given strings are decoded. I don't think that is a good solution, at least because it doesn't look consistent to me.

otterley commented 13 years ago

Alright, I'll add an encoding attribute, and encode requests and decode responses according to the value. New commit forthcoming.

otterley commented 13 years ago

If I change that test to not reuse $a:

$r->get("\x{a3", ...

then it fails. That's what I'm talking about.

Worked for me when I didn't forget the closing brace ;-)

miyagawa commented 13 years ago

I didn't look at the whole code but your last patch seems to only deal with writies with encode(). To make the behavior symmetric it should also perform decode() on reads, shouldn't it?

otterley commented 13 years ago

The decoding is performed in AnyEvent::Redis::Protocol.

miyagawa commented 13 years ago

Worked for me when I didn't forget the closing brace ;-)

It fails for me with 8e593a. http://gist.github.com/742326

(I know it's irrelevant now with your new commit but wanted to point out)

dgl commented 13 years ago

This looks good other than a few minor things, but I couldn't resist rewriting some of it ;).

I haven't yet pushed to miyagawa's repo but it's in mine.

otterley commented 13 years ago

@dgl Now that you've forked mine, how do we merge this back together?

miyagawa commented 13 years ago

Not something you have to worry about. Git doesn't care where it's forked off from.

otterley commented 13 years ago

I'm sure that's true, but I don't know how.

dgl commented 13 years ago

One way to do it is: git remote add dgl git://github.com/dgl/AnyEvent-Redis git fetch dgl git merge dgl/master

Which will bring my changes into yours.

otterley commented 13 years ago

Thanks! (new to Git, obviously)

otterley commented 13 years ago

@miyagawa, will you accept the request?