jeremykendall / php-domain-parser

Public Suffix List based domain parsing implemented in PHP
MIT License
1.16k stars 128 forks source link

Rules::parse() is slow, here is a possible solution #315

Closed langemeijer closed 3 years ago

langemeijer commented 3 years ago

Rules::parse() is a slow beast. On my fairly recent development machine this method takes little less than 0.1 second.

It's sad that we have to parse the rules every time again, while on many systems the public_suffix_list.dat file will probably not change too much.

I run a modified Rules class that has these two methods:


    public function writeCacheFile($fileName): bool
    {
        $result = file_put_contents($fileName, '<?php return ' . var_export($this->rules, true) . ';') !== false;

        if ($result && function_exists('opcache_invalidate')) {
            opcache_invalidate($fileName);
        }

        return $result;
    }

    public static function fromCacheFile($fileName): self
    {
        return new self(require($fileName));
    }

Measured on my dev laptop Rules::fromPath() took 0.0968s, subsequent ->writeCacheFile() file took 0.0026s

The first Rules::fromCacheFile() on the cli (no opcache) took 0.0060s subsequent calls took 0.0034s

The speed gain here is mostly the avoided regexp parsing of the the entire .dat file.

When OPcache is enabled in the apache2 or fpm sapi the gain is even bigger. Not only CPU cycles are spared, but also memory. Because the strings are written into a PHP file OPcache can keep all string values in the cache file in the Interned Strings Buffer, which is in shared memory allocated by the process manager. They are now string constants and no memory has to be allocated, just pointers made, thus avoiding the need to strdup() the data all over the place.

nyamsprod commented 3 years ago

@langemeijer thanks for using this package. Indeed Rules::parse is the bottleneck and your solution seems to work for your use case. However, as stated in the documentation, you should not call this method on every usage of the package. You should instead cache the Rules instance all together for a period of time. The recommend period depends on your use case. In the documentation I use PSR interfaces but they are not mandatory, you can use the mechanism you are more comfortable with.

langemeijer commented 3 years ago

Someone pointed out to me that you've implemented __set_state(), probably to get the other caching thing working through serialization.

This means I can do this from the outside, instead of in the class itself:

    function writeCacheFile(Rules $rules): bool
    {
        $result = file_put_contents($fileName, '<?php return ' . var_export($rules, true) . ';') !== false;

        if ($result && function_exists('opcache_invalidate')) {
            opcache_invalidate($fileName);
        }

        return $result;
    }

    function fromCacheFile($fileName): Rules
    {
        return require($fileName);
    }

It still puzzles me why you would 'seal' a constructor and declare the property private, from something that essentially is a container of rules to work with. And then open it up completely through __set_state().

You have created a monolith class that parses the database, is a container for the rules and does the solving based on it's internal data, keeping everything secret, except: It doesn't because it needs to be cached to work.

In my opinion ideally you would have a Parser class that parses, a Rules class that holds the parsed data and a Resolver class that can resolve a domain name based on the information in the Rules class.

nyamsprod commented 3 years ago

This means I can do this from the outside, instead of in the class itself:

Indeed the whole goal is to have the caching done on your application based and not on the package itself.

Your solution could very well implement the following interface present in the package ​Pdp\Storage\PublicSuffixListCache.

Because different applications require different caching mechanisms. The Rules class exposes methods to allow just that. Again your solution is not better or worse than the PSR one provided by the package the end goal is and remain you should cache your Rules instance.

In my opinion ideally you would have a Parser class that parses, a Rules class that holds the parsed data and a Resolver class that can resolve a domain name based on the information in the Rules class.

Correct in previous versions there was a Parser, and a Rules/Resolver class. The issue was that the parser in my view is an implementation details (ie: it is bound/coupled to the Rules/Resolver) if you change your parser, you will also change your Rules/Resolver so having them separated was not that useful even from a testing POV.

nyamsprod commented 3 years ago

Here's a quick and dirty implementation.

https://gist.github.com/nyamsprod/e5498e3547e61c4bb698b787c400ddb0

As you can see I have use:

So everything should be fine at least from my POV. You do not have to modify anything to any class and you avoid having to use constantly the Rules::parse method and calling external files which is not even recommanded by Mozilla.

Hopefully this explain better why no storage solution is given out of the box because you can virtually implement dozen of solution depending on your context.