Closed Lanzaa closed 10 years ago
We had a similar issue with thrift::xs which is why I have that patch in my forked version.
On Jul 25, 2013, at 5:04 PM, "Colin B." notifications@github.com<mailto:notifications@github.com> wrote:
Hey Kjellman I will make fix this and make some test cases to prevetn regression.
Essentially perl treats things as utf when it should not. This breaks various decoding routines which assume byte strings and not character strings.
One example is a DataType with value "00000140114cdd9e" looks like "00000140114c\x{dd9e}" or "@Lhttps://github.com/Lݞ". Any type that has a value that might look like a utf8 encoded string could be affected.
— Reply to this email directly or view it on GitHubhttps://github.com/mkjellman/perlcassa/issues/47.
This also affects the usage of UTF8 column types.
Hello, After I've updated to the last version (0.060) I've new errors in my tests. I'm french, so I used some special characters. In my cassandra database I've this text value "Gérard Miller, né le 3 juillet 1948 à Neuilly-sur-Seine" but perlcassa retrieve me this: G�rard Miller, n� le 3 juillet 1948 � Neuilly-sur-Seine
Hello @magnome ,
I'm glad you made some test cases! The UTF8 fix was difficult and likely to break things. The way perlcassa handled utf8 in the past was buggy and buggy in such a way that it hid other bugs.
In order to debug your issue I would need some information.
sub perlcassa_diagnostic {
my $val = shift;
my $pcversion = perlcassa->VERSION;
my $tversion = Thrift::XS->VERSION;
print "[DIAG] perlcassa version: '$pcversion'\n";
print "[DIAG] Thrift::XS version: '$tversion'\n";
print "[DIAG] 1 utf8? ".(utf8::is_utf8($val)?"yes":"no")."\tvalid? ".(utf8::valid($val)?"yes":"no")."\n";
print "[DIAG] 1 Value: '$val'\n";
print "[DIAG] 1 Ords : ".join(":", map { ord } split //, $val)."\n";
print "[DIAG] 1 pack : ".unpack("H* ", pack("a*", $val))."\n";
utf8::encode($val);
print "[DIAG] 2 utf8? ".(utf8::is_utf8($val)?"yes":"no")."\tvalid? ".(utf8::valid($val)?"yes":"no")."\n";
print "[DIAG] 2 Value: '$val'\n";
print "[DIAG] 2 Ords : ".join(":", map { ord } split //, $val)."\n";
print "[DIAG] 2 pack : ".unpack("H* ", pack("a*", $val))."\n";
}
For example:
use utf8; perlcassa_diagnostic("Gérard");
[DIAG] perlcassa version: '0.060'
[DIAG] Thrift::XS version: '1.04'
[DIAG] 1 utf8? yes valid? yes
[DIAG] 1 Value: 'G�rard'
[DIAG] 1 Ords : 71:233:114:97:114:100
[DIAG] 1 pack : 47e972617264
[DIAG] 2 utf8? no valid? yes
[DIAG] 2 Value: 'Gérard'
[DIAG] 2 Ords : 71:195:169:114:97:114:100
[DIAG] 2 pack : 47c3a972617264
With that information I will better be able to assist you with your issue. The most likely explanation is that you do not properly support utf8 in your code. Strings in perl are character or byte based. Perlcassa now returns character based strings. You likely just need to make sure that you are outputing your strings in the proper encoding.
binmode(STDOUT, ":utf8");
Hello,
I'm using Cassandra 1.2.10.1.
I've created a simple schema for our purpose:
CREATE TABLE test(key int PRIMARY KEY,value text);
and a simple programm:
use Try::Tiny; use Test::More; use Magus;
sub perlcassa_diagnostic { my $val = shift; my $pcversion = perlcassa->VERSION; my $tversion = Thrift::XS->VERSION; print "[DIAG] perlcassa version: '$pcversion'\n"; print "[DIAG] Thrift::XS version: '$tversion'\n"; print "[DIAG] 1 utf8? ".(utf8::isutf8($val)?"yes":"no")."\tvalid? ".(utf8::valid($val)?"yes":"no")."\n"; print "[DIAG] 1 Value: '$val'\n"; print "[DIAG] 1 Ords : ".join(":", map { ord } split //, $val)."\n"; print "[DIAG] 1 pack : ".unpack("H* ", pack("a", $val))."\n"; utf8::encode($val); print "[DIAG] 2 utf8? ".(utf8::isutf8($val)?"yes":"no")."\tvalid? ".(utf8::valid($val)?"yes":"no")."\n"; print "[DIAG] 2 Value: '$val'\n"; print "[DIAG] 2 Ords : ".join(":", map { ord } split //, $val)."\n"; print "[DIAG] 2 pack : ".unpack("H ", pack("a*", $val))."\n"; }
my $instance = Magus::Instance->initialize(); my $client = $instance->warehouse_factory->adaptor;
my $key = 1; my $result; try { $client->exec("truncate test"); $client->exec("insert into test (key, value) values (:key,:value);", { key => $key, value => 'Gérard à Nice', }); $result = $client->exec("SELECT * FROM test WHERE key = :key",{key => $key} ); } catch { Magus::Tools::Other::CassandraStackTracePrinter->print($_); }; my $row = $result->fetchone; my $value = $row->{value}; perlcassa_diagnostic $value; binmode(STDOUT, ":utf8"); ok($value eq "Gérard à Nice"); is($value,"Gérard à Nice");
OUTPUT:
[DIAG] perlcassa version: '0.060' [DIAG] Thrift::XS version: '1.04' [DIAG] 1 utf8? yes valid? yes [DIAG] 1 Value: 'G�rard � Nice' [DIAG] 1 Ords : 71:233:114:97:114:100:32:224:32:78:105:99:101 [DIAG] 1 pack : 47e97261726420e0204e696365 [DIAG] 2 utf8? no valid? yes [DIAG] 2 Value: 'Gérard à Nice'
SeqFeatureQualifierValue2.t line 47.
SeqFeatureQualifierValue2.t line 48.
[DIAG] 2 Ords : 71:195:169:114:97:114:100:32:195:160:32:78:105:99:101 [DIAG] 2 pack : 47c3a97261726420c3a0204e696365 not ok 1 not ok 2
If I use
binmode(STDOUT, ":utf8");
I can have a correct output for a standard print, but my test using Test::More::is and Test::More::ok still fail.
On 03/12/2013 01:53, Colin B. wrote:
Hello @magnome https://github.com/magnome ,
I'm glad you made some test cases! The UTF8 fix was difficult and likely to break things. The way perlcassa handled utf8 in the past was buggy and buggy in such a way that it hid other bugs.
In order to debug your issue I would need some information.
- What version of Cassandra are you testing against?
- What is the schema for the data you are retrieving?
- What is the output of the following function on your data?
subperlcassa_diagnostic { my $val = shift; my $pcversion = perlcassa->VERSION; my $tversion = Thrift::XS->VERSION; print "[DIAG] perlcassa version: '$pcversion'\n"; print "[DIAG] Thrift::XS version: '$tversion'\n"; print "[DIAG] 1 utf8?".(utf8::isutf8($val)?"yes":"no")."\tvalid?".(utf8::valid($val)?"yes":"no")."\n"; print "[DIAG] 1 Value: '$val'\n"; print "[DIAG] 1 Ords :".join(":", map { ord } split //, $val)."\n"; print "[DIAG] 1 pack :".unpack("H", pack("a_", $val))."\n"; utf8::encode($val); print "[DIAG] 2 utf8?".(utf8::isutf8($val)?"yes":"no")."\tvalid?".(utf8::valid($val)?"yes":"no")."\n"; print "[DIAG] 2 Value: '$val'\n"; print "[DIAG] 2 Ords :".join(":", map { ord } split //, $val)."\n"; print "[DIAG] 2 pack :".unpack("H", pack("a_", $val))."\n"; }
For example:
use utf8; perlcassa_diagnostic("Gérard"); [DIAG] perlcassa version: '0.060' [DIAG] Thrift::XS version: '1.04' [DIAG] 1 utf8? yes valid? yes [DIAG] 1 Value: 'G�rard' [DIAG] 1 Ords : 71:233:114:97:114:100 [DIAG] 1 pack : 47e972617264 [DIAG] 2 utf8? no valid? yes [DIAG] 2 Value: 'Gérard' [DIAG] 2 Ords : 71:195:169:114:97:114:100 [DIAG] 2 pack : 47c3a972617264 With that information I will better be able to assist you with your issue. The most likely explanation is that you do not properly support utf8 in your code. Strings in perl are character or byte based. Perlcassa now returns character based strings. You likely just need to make sure that you are outputing your strings in the proper encoding.
binmode(STDOUT, ":utf8");
— Reply to this email directly or view it on GitHub https://github.com/mkjellman/perlcassa/issues/47#issuecomment-29674048.
Hey @magnome,
I like the schema, simple and to the point.
Is your source code saved encoded in utf8 ? If so add 'use utf8;' to your source. This tells perl to interpret your source code as containing utf8.
In perl strings are basically arrays of characters (see Perl Encode). If you run the following program you can see the difference. The 'é' is interpreted differently without and with 'use utf8;'. I believe adding 'use utf8;' is the correct way.
my $value;
$value = "Gérard";
print "characters: ".join(":", map { ord } split //, $value)."\n";
use utf8;
$value = "Gérard";
print "characters: ".join(":", map { ord } split //, $value)."\n";
Output:
characters: 71:195:169:114:97:114:100
characters: 71:233:114:97:114:100
Perlcassa 0.060 is returning strings of characters when you get data from a text column. This is because the text column in CQL is specified to be a UTF8 encoded string. However, if you don't like your strings like that, there is an alternative. Run 'utf8::encode' on the values you receive from perlcassa.
$value = "Gérard";
utf8::decode($value); # Now value looks like it does when it is retrieved from perlcassa
print "characters: ".join(":", map { ord } split //, $value)."\n";
utf8::encode($value);
print "characters: ".join(":", map { ord } split //, $value)."\n";
Output:
characters: 71:233:114:97:114:100
characters: 71:195:169:114:97:114:100
I personally think 'use utf8;' and perlcassa returning strings of characters is the right way to do things. If you think the documentation should be improved or an option added to return the bytes from text columns, please let us know. Hopefully this helps solve your issue. Goodluck!
Thanks, it's much much clearer now. I 'm quite new in perl programming, so maybe Perl encoding is a trivial stuff... But for me the documentation should be improved.
Thanks again. On 03/12/2013 20:16, Colin B. wrote:
Hey @magnome https://github.com/magnome,
I like the schema, simple and to the point.
Is your source code saved encoded in utf8 ? If so add 'use utf8;' to your source. This tells perl to interpret your source code as containing utf8.
In perl strings are basically arrays of characters (see Perl Encode http://perldoc.perl.org/Encode.html#DESCRIPTION). If you run the following program you can see the difference. The 'é' is interpreted differently without and with 'use utf8;'. I believe adding 'use utf8;' is the correct way.
my $value; $value = "Gérard"; print "characters:".join(":", map { ord } split //, $value)."\n"; use utf8; $value = "Gérard"; print "characters:".join(":", map { ord } split //, $value)."\n";
Output:
characters: 71:195:169:114:97:114:100 characters: 71:233:114:97:114:100 Perlcassa 0.060 is returning strings of characters when you get data from a text column. This is because the text column in CQL is specified to be a UTF8 encoded string. However, if you don't like your strings like that, there is an alternative. Run 'utf8::encode' on the values you receive from perlcassa.
$value = "Gérard"; utf8::decode($value); # Now value looks like it does when it is retrieved from perlcassa print "characters:".join(":", map { ord } split //, $value)."\n"; utf8::encode($value); print "characters:".join(":", map { ord } split //, $value)."\n";
Output:
characters: 71:233:114:97:114:100 characters: 71:195:169:114:97:114:100 I personally think 'use utf8;' and perlcassa returning strings of characters is the right way to do things. If you think the documentation should be improved or an option added to return the bytes from text columns, please let us know. Hopefully this helps solve your issue. Goodluck!
— Reply to this email directly or view it on GitHub https://github.com/mkjellman/perlcassa/issues/47#issuecomment-29741432.
Hey Kjellman I will make fix this and make some test cases to prevent regression.
Essentially perl treats things as utf8 when it should not. This breaks various decoding routines which assume byte strings and not character strings.
One example is a DataType with value "00000140114cdd9e" looks like "00000140114c\x{dd9e}" or "@Lݞ". Any type that has a value that might look like a utf8 encoded string could be affected.