john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Performance Issues after Upgrade from 3.4.1 to 3.4.4 #297

Closed koromkorom closed 1 year ago

koromkorom commented 1 year ago

Hi,

I am very sure that this is 100% our fault, but I just wanted to also make sure, that this is not also effecting people who are actually using this package in the right way.

We had some performance issues and after a long investigation involving pip freeze we realized that it came from an TLDExtract upgrade from 3.4.1 to 3.4.4.

We were using tldextract like this in a function:

def get_domain(url: str):
    extractor = tldextract.TLDExtract(
        cache_dir=None, suffix_list_urls=[], fallback_to_snapshot=True
    )
    ext = extractor(url)
    return "{}.{}".format(ext.domain, ext.suffix)

We have quite a few calls to this function per second and for some reason this slowed down significantly after a recent upgrade from 3.4.1 to 3.4.4.

I tried to do some profiling:

I left this running for roughly the same amount of time, but it is unlikely that the method above was called the same amount of time, because this was slowing down the whole program.

3.4.1 (no issues): image

3.4.4 (issue): image

As a fix I just took the callable constructor out of that function and now the performance is way better then before even:

extractor = tldextract.TLDExtract(
        cache_dir=None, suffix_list_urls=[], fallback_to_snapshot=True
    )
def get_domain(url: str):
    ext = extractor(url)
    return "{}.{}".format(ext.domain, ext.suffix)

I realize this was probably just bad practice from our side, but as said before I just wanted to make sure there is really nothing of interest here. Feel free to close this.

elliotwutingfeng commented 1 year ago

Hi,

The slowdown is likely due to the constructor initializing suffix Tries. This was implemented in v3.4.3 to speed up the extraction.

        self.tlds_incl_private = frozenset(public_tlds + private_tlds + extra_tlds)
        self.tlds_excl_private = frozenset(public_tlds + extra_tlds)
        self.tlds_incl_private_trie = Trie.create(self.tlds_incl_private)
        self.tlds_excl_private_trie = Trie.create(self.tlds_excl_private)

Your solution is correct since you are always using the same fallback snapshot, hence calling the constructor once and reusing the extractor is recommended practice.


A sidenote on the following line

return "{}.{}".format(ext.domain, ext.suffix)

get_domain() would return nonexistenttld. for example.nonexistenttld (example parsed as subdomain, nonexistenttld parsed as domain) and .co.uk for co.uk (blank string passed as domain)

Would suggest using ext.registered_domain instead.

koromkorom commented 1 year ago

OK thank you! That is all very good to know. And thank you also for your feedback!