libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

Potential speedups for HTML::Entities::encode_entities #30

Open petdance opened 2 years ago

petdance commented 2 years ago

I think we could get some speedups in encode_entities by caching some common operations. The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.

The first is to cache the results of the sprintf in num_entity. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.

use Benchmark ':all';
use HTML::Entities;

timethese( 300_000,
    {
        original => sub {
            map { HTML::Entities::num_entity($_) } 1..100
        },
        cached => sub {
            map { cached_num_entity($_) } 1..100
        },
    }
);

my %cache;
sub cached_num_entity {
    return $cache{$_[0]} ||= sprintf("&#x%X;", ord($_[0]));
}

gives these results:

    cached: 10 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 28517.11/s (n=300000)
  original: 15 wallclock secs (14.28 usr +  0.00 sys = 14.28 CPU) @ 21008.40/s (n=300000)

Bottom line: Hash lookup is faster than the sprintf, so let's cache it.


The other tweak would be to cache the call to num_entity inside the main regex in encode_entities. Swap this:

$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;

for this

$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||= num_entity($1)/ge;

This would have the side effect of modifying the %char2entity hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified %char2entity, it would have no effect on encode_entities.

For benchmarking encode_entities, I used this:

my $ent = chr(8855);
my $num = chr(8854);
my $unencoded = "$ent$num" x 10;

my $text = <<"HTML";
text in "$unencoded"
HTML

timethese( 1_000_000,
    {
        encode => sub { my $x = encode_entities($text) },
    }
);

Results:

42,281/s for the original unmodified encode_entities.

52,746/s if the encode_entities used the caching num_entity first mentioned, but the main regex is unchanged.

64,769/s if the main conversion regex caches the results of calls to num_entity in %char2entity. Changing this to call the caching num_entity gave no noticeable improvement.


I hope these give some ideas. encode_entities is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.

oalders commented 2 years ago

How do you envision the cache? An in-memory LRU or is this a hash that can grow quite large? Have you tried seeing what kind of speed gains can be had via Memoize?

petdance commented 2 years ago

Just a hash that could theoretically but not likely grow quite large, as I showed in the code above.

I didn't bother with Memoize because I figure we wouldn't want to add another dependency, but now I see that Memoize has been core since 5.7.3, so not an issue.