Open FGasper opened 3 years ago
Thanks! Any chance you could send a pull request? I'm sure @salortiz would accept it.
@FGasper,
As your code shows, with MLDB_File what you put in is what you get out, so if you need any particular encoding your code should use the standard Encode
module. For the already complex Perl's string handling the module don't attempts to be cleaver than the user.
Only If your code use a proper UTF-8 environment (via use utf8;
, use feature 'unicode_strings'
and use open ':std', IO => ...
, etc.) you can enable the automatic handling setting $DB->UTF8(1)
.
And in LMDB_File's UTF8 mode a lexical use bytes
pragma can be used as a escape hatch if you known what are you doing, YMMV.
@salortiz
My code was admittedly contrived—production code doesn’t normally call utf8::upgrade
—but given Perl’s internal encoding abstraction, a Perl application doesn’t actually know “what it puts in” in terms of the PV itself.
Here are a couple more feasible ways of currently arriving at upgraded strings:
my $foo = "é"; $foo .= "\x{100}"; chop $foo
Of course, those won’t necessarily be true for any given Perl version, and that’s the point of the abstraction: things that aren’t the Perl interpreter shouldn’t care how Perl internally stores its strings. If two SVs eq
each other in Perl, then the application should reasonably expect that those strings will present outside Perl the same way—as is the case with, e.g., writing to a filehandle as well as modules like MIME::Base64 and JSON::XS.
Encode
doesn’t always (currently) behave as you’d think regarding the SV internals; for example, `decode('Latin-1') outputs upgraded strings, despite the fact that “decoding Latin-1” is basically a no-op.
Perl’s string handling isn’t all that complex; it’s just a bit buggy and weirdly documented. But the idea is that two strings are equal if they store the same code points, regardless of whether Perl stores them with the same internal encoding. SvPV
violates that and is thus nearly always the wrong choice unless whatever uses it also checks SvUTF8.
Calling SvPVbyte would thus actually be less clever than calling SvPV since SvPVbyte always yields a consistent result regardless of Perl’s internals.
@salortiz I guess you’re in the same boat as CDB_File … this problem can’t really be fixed without potentially breaking existing applications.
We can work around this problem by utf8::downgrade()
ing everything before sending it in to LMDB_File. Alternatively, would you be amenable to adding a ->bytes($bool)
that does the SvPVbyte
behaviour? I could implement that if you want.
@FGasper,
I think that Perl (and SvPVs in particular) violates your expectations. But that don't means that Perl nor LMDB_File
are buggy.
The module assumes that the user known what its data contains.
Your affirmation that
… the fact that “decoding Latin-1” is basically a no-op.
is plain false, the result don't even have the same length:
use v5.12; # Need 'say'
use Encode;
my $decoded = decode("latin1", my $l1acute = "\x{e1}");
say bytes::length $l1acute; # say "1"
say bytes::length $decoded; # say "2"
In your original example I don't known what do you expect to be in $u
after utf8::upgrade($u)
.
But perl is clear that you has four bytes (certainly not an "ü" in any encoding), please run:
perl -E 'my $u = "ü"; utf8::upgrade($u); use bytes; say bytes::length($u)'
(You can fix your expectations adding a use utf8
telling perl that "ü"
is already in UTF-8 so utf8::upgrade
becomes a no-op)
As both utf8::upgrade
and utf8::downgrade
can potentially modify the content of the user's SV (depending of the current state of the UTF8 flag) and that is bad practice your idea of an unconditional utf8::downgrade
cannot be considered, sorry.
If you think that $DB->bytes($bool)
is needed, please make a PR.
@salortiz You’re using bytes.pm; that module’s own documentation emphatically says to avoid it except for debugging purposes because it breaks Perl’s string encapsulation.
utf8::upgrade
does not change how $u
generally behaves in Perl. It’s the same set of code points as before the upgrade; Perl just stores those code points differently. You can verify this by eq
-comparing a string and its upgraded or downgraded equivalent. You can also print
or say
such strings, and the upgrade/downgrade won’t make a difference:
my $foo = "é";
printf "$foo (len: %d) $/", length $foo;
utf8::upgrade($foo);
printf "$foo (len: %d) $/", length $foo;
Thus, in theory it really shouldn’t matter if a module upgrades or downgrades a passed-in scalar. That said, I did follow your precedent of avoiding changing the passed-in scalar in my PR, for what that’s worth.
… prints:
The UTF8 option toggles between SvPV and SvPVutf8; it should actually toggle between SvPVbyte and SvPVutf8 to avoid a spurious dependency on Perl’s internal string storage logic.