Closed doriantaylor closed 5 years ago
This change sounds fine to me. @karenetheridge? @genio?
Thanks, let me know if there's consensus and I'll do up a patch.
I would be worried some people are using ->canonical
to update their object in place, which this would break.
While I agree that ->canonical
should always return a clone, I have to second @karenetheridge in that we might break a lot of things by changing this behavior.
If you can do some clever searching on https://grep.cpan.me or https://grep.metacpan.org to find instances where people are using ->canonical
to ensure nothing would break, or better yet, smoke test all of the reverse dependencies, that would go a long way towards quelling our fear of breaking user-land.
If you look at the code it either does nothing (returning $self
) or returns a clone:
sub canonical
{
# Make sure scheme is lowercased, that we don't escape unreserved chars,
# and that we use upcase escape sequences.
my $self = shift;
my $scheme = $self->_scheme || "";
my $uc_scheme = $scheme =~ /[A-Z]/;
my $esc = $$self =~ /%[a-fA-F0-9]{2}/;
return $self unless $uc_scheme || $esc;
my $other = $self->clone;
if ($uc_scheme) {
$other->_scheme(lc $scheme);
}
if ($esc) {
$$other =~ s{%([0-9a-fA-F]{2})}
{ my $a = chr(hex($1));
$a =~ /^[$unreserved]\z/o ? $a : "%\U$1"
}ge;
}
return $other;
}
In other words, there is no in-place updating. If you called it in void context it would do nothing to the object, no matter its contents. The scenario in which a person is depending on the return value not to be cloned doesn't make very much sense:
my $uri = URI->new($MAYBE_ALREADY_CANONICAL);
my $canon = $uri->canonical;
# why is it important that $canon == $uri (cf perldoc perlref),
# but only if the URI was already in canonical form?
It is also worth noting that the subclasses that implement their own canonical
follow this pattern (in this case from URI::_server):
sub canonical
{
my $self = shift;
my $other = $self->SUPER::canonical;
my $host = $other->host || "";
my $port = $other->_port;
my $uc_host = $host =~ /[A-Z]/;
my $def_port = defined($port) && ($port eq "" ||
$port == $self->default_port);
if ($uc_host || $def_port) {
$other = $other->clone if $other == $self;
$other->host(lc $host) if $uc_host;
$other->port(undef) if $def_port;
}
$other;
}
If you round up all the subclassed canonical
methods, they all immediately open with my $other = $self->SUPER::canonical
and feature the same idiom like $other = $other->clone if $other == $self
in the part that actually does any work. It is hard to look at this construct in the large and not consider it to be more than an optimization.
Anyway, my proposal (in URI.pm) goes something like this:
sub canonical
{
# Make sure scheme is lowercased, that we don't escape unreserved chars,
# and that we use upcase escape sequences.
my $other = $_[0]->clone;
my $scheme = $other->_scheme || "";
my $uc_scheme = $scheme =~ /[A-Z]/;
my $esc = $$other =~ /%[a-fA-F0-9]{2}/;
return $other unless $uc_scheme || $esc;
if ($uc_scheme) {
$other->_scheme(lc $scheme);
}
if ($esc) {
$$other =~ s{%([0-9a-fA-F]{2})}
{ my $a = chr(hex($1));
$a =~ /^[$unreserved]\z/o ? $a : "%\U$1"
}ge;
}
return $other;
}
Maybe @gisle had a more nuanced reason for doing it this way?
From perldoc URI
:
For efficiency reasons, if the
$uri
is already in normalized form, then a reference to it is returned instead of a copy.
If you look at the code it either does nothing (returning $self) or returns a clone:
Well in that case, my concerns were for naught :)
I'd be interested in hearing from others about what efficiencies we might be losing (as gisle referred to). One I can think of is it would encourage more memory pages to be copied (via CoW) in a forked environment, which might be a big deal.
Personally, I think this efficiency is hardly worth the inconsistency given all the other inefficiencies of URI, and that clone is ridiculously simple. Benchmarks might be useful though.
Indeed:
sub clone
{
my $self = shift;
my $other = $$self;
bless \$other, ref $self;
}
My guess is the internal Perl subroutine overhead is probably the most expensive part of this method.
The optimization argument also gets a little iffy when what you want is a cloned object: you have to call $uri->clone->canonical
(or, I suppose, $uri->canonical->clone
), which will invoke clone
(up to) twice. I suppose you could test the output but frankly that would just be more code and more overhead than just running clone
once unconditionally.
Again I'm not sure what kind of context hinges on the return value not being a clone, unless you're doing some kind of post-hoc test to determine whether the URI was canonicalized; i.e. it would be a teensy bit faster to check the numeric equality of the before and after reference addresses than it would be to test the respective contents for string equality. Indeed, this is how the subclassed canonical
methods determine whether they should clone the reference (moreover, they wouldn't have to care if the reference was cloned unconditionally by SUPER::canonical
and those subclassed methods could be simplified accordingly).
So the use case for not unconditionally cloning the return value of canonical
is tenuous, and the efficiency gain is suspect.
It sounds like we can go ahead with this change. Could we get an accompanying test and some basic benchmarks?
So it turns out that the no-op scenario is slower, by about 7%. This should not surprise anybody:
# stock URI 1.74
$ perl ~/experimental/uri-canon-bench.pl
timethis 100000: 3 wallclock secs ( 2.54 usr + 0.00 sys = 2.54 CPU) @ 39370.08/s (n=100000)
# with changes
$ perl -Ilib ~/experimental/uri-canon-bench.pl
timethis 100000: 3 wallclock secs ( 2.73 usr + 0.00 sys = 2.73 CPU) @ 36630.04/s (n=100000)
However this goes away the moment there is something to do:
# again stock
$ perl ~/experimental/uri-canon-bench.pl
timethis 100000: 3 wallclock secs ( 3.33 usr + 0.00 sys = 3.33 CPU) @ 30030.03/s (n=100000)
# again changes
$ perl -Ilib ~/experimental/uri-canon-bench.pl
timethis 100000: 3 wallclock secs ( 3.22 usr + 0.00 sys = 3.22 CPU) @ 31055.90/s (n=100000)
And of course if you add a prophylactic clone
into the mix, any efficiency gains go away completely:
# stock; $uri->clone->canonical
$ perl ~/experimental/uri-canon-bench.pl
timethis 100000: 4 wallclock secs ( 3.56 usr + 0.00 sys = 3.56 CPU) @ 28089.89/s (n=100000)
# with changes, $uri->canonical (no ->clone because we don't need it anymore)
$ perl -Ilib ~/experimental/uri-canon-bench.pl
timethis 100000: 3 wallclock secs ( 3.27 usr + 0.00 sys = 3.27 CPU) @ 30581.04/s (n=100000)
In case you're wondering, the benchmark looks like this, with me manually fiddling around the various details between runs:
#!perl
use strict;
use warnings FATAL => 'all';
use Benchmark;
use URI;
# made-up non-canonical /path to trigger code in URI::canonical but not subclasses
my $uri = URI->new('https://domain.com/%70ath?key=value#fragment');
# if you really wanna slow it down, put the port number in
# canonical no-op URI
my $canon = $uri->canonical;
Benchmark::timethis(100000, sub { $uri->canonical });
I came across this recently: I had assumed that
canonical
always returned a clone, but it returns$self
if it determines in the preamble that there is nothing to do. Arguments for makingcanonical
unconditionally return a clone:$uri->canonical
is shorter than typing$uri->clone->canonical
The only counterargument I can think of is slightly more overhead in the no-op scenario, and the change of course is one line.
Thoughts?