Upgraded strings yield wrong results.

FGasper commented 3 years ago

use strict;
use warnings;

use LMDB_File qw(:envflags :cursor_op);

use File::Temp;
my $dir = File::Temp::tempdir( CLEANUP => 1 );

my $d = "é";
my $u = "ü";
utf8::upgrade $u;

my $path = "$dir/file";

{
    my %hash;
    my $db = tie %hash, 'LMDB_File', "$dir/file", MDB_NOSUBDIR;

    $hash{$d} = $d;
    $hash{$u} = $u;
}

use Data::Dumper;
{
    my %hash;
    my $db = tie %hash, 'LMDB_File', "$dir/file", MDB_NOSUBDIR;

    print Dumper \%hash;
}

… prints:

$VAR1 = {
          'Ã¼' => 'Ã¼',
          'é' => 'é'
        };

The UTF8 option toggles between SvPV and SvPVutf8; it should actually toggle between SvPVbyte and SvPVutf8 to avoid a spurious dependency on Perl’s internal string storage logic.

hoytech commented 3 years ago

Thanks! Any chance you could send a pull request? I'm sure @salortiz would accept it.

salortiz commented 3 years ago

@FGasper,

As your code shows, with MLDB_File what you put in is what you get out, so if you need any particular encoding your code should use the standard Encodemodule. For the already complex Perl's string handling the module don't attempts to be cleaver than the user.

Only If your code use a proper UTF-8 environment (via use utf8;, use feature 'unicode_strings' and use open ':std', IO => ..., etc.) you can enable the automatic handling setting $DB->UTF8(1).

And in LMDB_File's UTF8 mode a lexical use bytes pragma can be used as a escape hatch if you known what are you doing, YMMV.

FGasper commented 3 years ago

@salortiz

My code was admittedly contrived—production code doesn’t normally call utf8::upgrade—but given Perl’s internal encoding abstraction, a Perl application doesn’t actually know “what it puts in” in terms of the PV itself.

Here are a couple more feasible ways of currently arriving at upgraded strings:

my $foo = "é"; $foo .= "\x{100}"; chop $foo
round-tripping through JSON (e.g., to duplicate a deep structure)

Of course, those won’t necessarily be true for any given Perl version, and that’s the point of the abstraction: things that aren’t the Perl interpreter shouldn’t care how Perl internally stores its strings. If two SVs eq each other in Perl, then the application should reasonably expect that those strings will present outside Perl the same way—as is the case with, e.g., writing to a filehandle as well as modules like MIME::Base64 and JSON::XS.

Encode doesn’t always (currently) behave as you’d think regarding the SV internals; for example, `decode('Latin-1') outputs upgraded strings, despite the fact that “decoding Latin-1” is basically a no-op.

Perl’s string handling isn’t all that complex; it’s just a bit buggy and weirdly documented. But the idea is that two strings are equal if they store the same code points, regardless of whether Perl stores them with the same internal encoding. SvPV violates that and is thus nearly always the wrong choice unless whatever uses it also checks SvUTF8.

Calling SvPVbyte would thus actually be less clever than calling SvPV since SvPVbyte always yields a consistent result regardless of Perl’s internals.

FGasper commented 3 years ago

@salortiz I guess you’re in the same boat as CDB_File … this problem can’t really be fixed without potentially breaking existing applications.

We can work around this problem by utf8::downgrade()ing everything before sending it in to LMDB_File. Alternatively, would you be amenable to adding a ->bytes($bool) that does the SvPVbyte behaviour? I could implement that if you want.

salortiz commented 3 years ago

@FGasper,

I think that Perl (and SvPVs in particular) violates your expectations. But that don't means that Perl nor LMDB_File are buggy.

The module assumes that the user known what its data contains.

Your affirmation that

… the fact that “decoding Latin-1” is basically a no-op.

is plain false, the result don't even have the same length:

use v5.12; # Need 'say'
use Encode;
my $decoded = decode("latin1", my $l1acute = "\x{e1}");
say bytes::length $l1acute; # say "1"
say bytes::length $decoded; # say "2"

In your original example I don't known what do you expect to be in $u after utf8::upgrade($u). But perl is clear that you has four bytes (certainly not an "ü" in any encoding), please run:

perl -E 'my $u = "ü"; utf8::upgrade($u); use bytes; say bytes::length($u)'

(You can fix your expectations adding a use utf8 telling perl that "ü" is already in UTF-8 so utf8::upgrade becomes a no-op)

As both utf8::upgrade and utf8::downgrade can potentially modify the content of the user's SV (depending of the current state of the UTF8 flag) and that is bad practice your idea of an unconditional utf8::downgrade cannot be considered, sorry.

If you think that $DB->bytes($bool) is needed, please make a PR.

FGasper commented 3 years ago

@salortiz You’re using bytes.pm; that module’s own documentation emphatically says to avoid it except for debugging purposes because it breaks Perl’s string encapsulation.

utf8::upgrade does not change how $u generally behaves in Perl. It’s the same set of code points as before the upgrade; Perl just stores those code points differently. You can verify this by eq-comparing a string and its upgraded or downgraded equivalent. You can also print or say such strings, and the upgrade/downgrade won’t make a difference:

my $foo = "é";
printf "$foo (len: %d) $/", length $foo;
utf8::upgrade($foo);
printf "$foo (len: %d) $/", length $foo;

Thus, in theory it really shouldn’t matter if a module upgrades or downgrades a passed-in scalar. That said, I did follow your precedent of avoiding changing the passed-in scalar in my PR, for what that’s worth.

salortiz / LMDB_File

Upgraded strings yield wrong results. #36