libwww-perl / URI

The Perl URI module
https://metacpan.org/pod/URI
Other
57 stars 48 forks source link

make `canonical` always return a clone #57

Closed doriantaylor closed 5 years ago

doriantaylor commented 5 years ago

I came across this recently: I had assumed that canonical always returned a clone, but it returns $self if it determines in the preamble that there is nothing to do. Arguments for making canonical unconditionally return a clone:

The only counterargument I can think of is slightly more overhead in the no-op scenario, and the change of course is one line.

Thoughts?

oalders commented 5 years ago

This change sounds fine to me. @karenetheridge? @genio?

doriantaylor commented 5 years ago

Thanks, let me know if there's consensus and I'll do up a patch.

karenetheridge commented 5 years ago

I would be worried some people are using ->canonical to update their object in place, which this would break.

genio commented 5 years ago

While I agree that ->canonical should always return a clone, I have to second @karenetheridge in that we might break a lot of things by changing this behavior.

If you can do some clever searching on https://grep.cpan.me or https://grep.metacpan.org to find instances where people are using ->canonical to ensure nothing would break, or better yet, smoke test all of the reverse dependencies, that would go a long way towards quelling our fear of breaking user-land.

doriantaylor commented 5 years ago

If you look at the code it either does nothing (returning $self) or returns a clone:

sub canonical
{
    # Make sure scheme is lowercased, that we don't escape unreserved chars,
    # and that we use upcase escape sequences.

    my $self = shift;
    my $scheme = $self->_scheme || "";
    my $uc_scheme = $scheme =~ /[A-Z]/;
    my $esc = $$self =~ /%[a-fA-F0-9]{2}/;
    return $self unless $uc_scheme || $esc;

    my $other = $self->clone;
    if ($uc_scheme) {
    $other->_scheme(lc $scheme);
    }
    if ($esc) {
    $$other =~ s{%([0-9a-fA-F]{2})}
                { my $a = chr(hex($1));
                      $a =~ /^[$unreserved]\z/o ? $a : "%\U$1"
                    }ge;
    }
    return $other;
}

In other words, there is no in-place updating. If you called it in void context it would do nothing to the object, no matter its contents. The scenario in which a person is depending on the return value not to be cloned doesn't make very much sense:

my $uri = URI->new($MAYBE_ALREADY_CANONICAL);
my $canon = $uri->canonical;

# why is it important that $canon == $uri (cf perldoc perlref),
# but only if the URI was already in canonical form?

It is also worth noting that the subclasses that implement their own canonical follow this pattern (in this case from URI::_server):

sub canonical
{
    my $self = shift;
    my $other = $self->SUPER::canonical;
    my $host = $other->host || "";
    my $port = $other->_port;
    my $uc_host = $host =~ /[A-Z]/;
    my $def_port = defined($port) && ($port eq "" ||
                                      $port == $self->default_port);
    if ($uc_host || $def_port) {
    $other = $other->clone if $other == $self;
    $other->host(lc $host) if $uc_host;
    $other->port(undef)    if $def_port;
    }
    $other;
}

If you round up all the subclassed canonical methods, they all immediately open with my $other = $self->SUPER::canonical and feature the same idiom like $other = $other->clone if $other == $self in the part that actually does any work. It is hard to look at this construct in the large and not consider it to be more than an optimization.

Anyway, my proposal (in URI.pm) goes something like this:

sub canonical
{
    # Make sure scheme is lowercased, that we don't escape unreserved chars,
    # and that we use upcase escape sequences.

    my $other = $_[0]->clone;
    my $scheme = $other->_scheme || "";
    my $uc_scheme = $scheme =~ /[A-Z]/;
    my $esc = $$other =~ /%[a-fA-F0-9]{2}/;
    return $other unless $uc_scheme || $esc;

    if ($uc_scheme) {
    $other->_scheme(lc $scheme);
    }
    if ($esc) {
    $$other =~ s{%([0-9a-fA-F]{2})}
                { my $a = chr(hex($1));
                      $a =~ /^[$unreserved]\z/o ? $a : "%\U$1"
                    }ge;
    }
    return $other;
}

Maybe @gisle had a more nuanced reason for doing it this way?

Edit

From perldoc URI:

For efficiency reasons, if the $uri is already in normalized form, then a reference to it is returned instead of a copy.

karenetheridge commented 5 years ago

If you look at the code it either does nothing (returning $self) or returns a clone:

Well in that case, my concerns were for naught :)

I'd be interested in hearing from others about what efficiencies we might be losing (as gisle referred to). One I can think of is it would encourage more memory pages to be copied (via CoW) in a forked environment, which might be a big deal.

Grinnz commented 5 years ago

Personally, I think this efficiency is hardly worth the inconsistency given all the other inefficiencies of URI, and that clone is ridiculously simple. Benchmarks might be useful though.

doriantaylor commented 5 years ago

Indeed:

sub clone
{
    my $self = shift;
    my $other = $$self;
    bless \$other, ref $self;
}

My guess is the internal Perl subroutine overhead is probably the most expensive part of this method.

The optimization argument also gets a little iffy when what you want is a cloned object: you have to call $uri->clone->canonical (or, I suppose, $uri->canonical->clone), which will invoke clone (up to) twice. I suppose you could test the output but frankly that would just be more code and more overhead than just running clone once unconditionally.

Again I'm not sure what kind of context hinges on the return value not being a clone, unless you're doing some kind of post-hoc test to determine whether the URI was canonicalized; i.e. it would be a teensy bit faster to check the numeric equality of the before and after reference addresses than it would be to test the respective contents for string equality. Indeed, this is how the subclassed canonical methods determine whether they should clone the reference (moreover, they wouldn't have to care if the reference was cloned unconditionally by SUPER::canonical and those subclassed methods could be simplified accordingly).

So the use case for not unconditionally cloning the return value of canonical is tenuous, and the efficiency gain is suspect.

oalders commented 5 years ago

It sounds like we can go ahead with this change. Could we get an accompanying test and some basic benchmarks?

doriantaylor commented 5 years ago

So it turns out that the no-op scenario is slower, by about 7%. This should not surprise anybody:

# stock URI 1.74
$ perl ~/experimental/uri-canon-bench.pl 
timethis 100000:  3 wallclock secs ( 2.54 usr +  0.00 sys =  2.54 CPU) @ 39370.08/s (n=100000)

# with changes
$ perl -Ilib ~/experimental/uri-canon-bench.pl 
timethis 100000:  3 wallclock secs ( 2.73 usr +  0.00 sys =  2.73 CPU) @ 36630.04/s (n=100000)

However this goes away the moment there is something to do:

# again stock
$ perl ~/experimental/uri-canon-bench.pl 
timethis 100000:  3 wallclock secs ( 3.33 usr +  0.00 sys =  3.33 CPU) @ 30030.03/s (n=100000)

# again changes
$ perl -Ilib ~/experimental/uri-canon-bench.pl 
timethis 100000:  3 wallclock secs ( 3.22 usr +  0.00 sys =  3.22 CPU) @ 31055.90/s (n=100000)

And of course if you add a prophylactic clone into the mix, any efficiency gains go away completely:

# stock; $uri->clone->canonical
$ perl ~/experimental/uri-canon-bench.pl 
timethis 100000:  4 wallclock secs ( 3.56 usr +  0.00 sys =  3.56 CPU) @ 28089.89/s (n=100000)

# with changes, $uri->canonical (no ->clone because we don't need it anymore)
$ perl -Ilib ~/experimental/uri-canon-bench.pl 
timethis 100000:  3 wallclock secs ( 3.27 usr +  0.00 sys =  3.27 CPU) @ 30581.04/s (n=100000)

In case you're wondering, the benchmark looks like this, with me manually fiddling around the various details between runs:

#!perl

use strict;
use warnings FATAL => 'all';

use Benchmark;
use URI;

# made-up non-canonical /path to trigger code in URI::canonical but not subclasses
my $uri = URI->new('https://domain.com/%70ath?key=value#fragment');

# if you really wanna slow it down, put the port number in

# canonical no-op URI
my $canon = $uri->canonical;

Benchmark::timethis(100000, sub { $uri->canonical });