john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Cache is fundamentally slower than HTTP #320

Closed pmitcheli1 closed 4 months ago

pmitcheli1 commented 4 months ago

Need some advice here whether this is a bug, or am missing sth in documentation.

Was doing some testing and want to deploy behind a FW - hence tried to do the "Cache" route, but it is 60x slower than using the non-cache version (1000 domains = 0.6sec for non-cache vs 31.3sec for cache ).

Is this a known bug, or I am missing something? Also tried specifying the cache file directly, same slow speed. Would expect cache to be a lot faster as it is local, ample ram, cpu, nvme ssd.

extract callable that falls back to the included TLD snapshot, no live HTTP fetching no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=()) no_fetch_extract('http://www.google.com')

john-kurkowski commented 4 months ago

Hi there! Can you give both code samples how you're benchmarking the 1,000 domains? That might give me some ideas.

pmitcheli1 commented 4 months ago

Hi John! I am reading data from a database and then pass the read URLs through a for loop to process_url function and write the results to the DB (I commented out all other functions to pinpoint the slowness). To double check I tried reading 2000 domains from TXT and writing to CSV as well, same function and received 0.04s vs 21.1sec (cache).

The only thing different between the two runs should be that I want to use cache, but I might be missing sth simple as a noob.

Code 1:  Runtime: 5000 domains in 2.787315 seconds
def process_url(url):
    extracted = tldextract.extract(url)
    full_domain = f"{extracted.domain}.{extracted.suffix}"
    domain = extracted.domain
    return full_domain , domain
Code 2: Runtime: 5000 domains in 115.430461 seconds  
def process_url(url):
    extracted = tldextract.TLDExtract(suffix_list_urls=())
    full_domain = f"{extracted(url).domain}.{extracted(url).suffix}"
    domain = extracted(url).domain
    return full_domain , domain
john-kurkowski commented 4 months ago

The function in Code 1 uses a global instance of TLDExtract with the default settings and it only processes the string 1x.

The function in Code 2 constructs a new instance of TLDExtract every time it is called, then it processes the string 3x (3 calls to extracted(…).

I suggest constructing TLDExtract instances closer to 1x per Python process, not in the loop, and getting down function calls from 3x to 1x per input string.

Regarding caching, we have some terminology mismatch. Code 1 downloads the newest public suffix definitions 1x per Python process. Code 2 does not; it uses the bundled definitions from when your version of the tldextract library was published. In other words, code 1 does a little network IO (but not bound to the number of input strings) and code 2 doesn't. There isn't caching being toggled in your examples per se. At least, not caching that would affect the scale of your input strings.

I'm curious about the loop, which you don't show. If my advice so far doesn't close the gap, let me know about the loop.

pmitcheli1 commented 4 months ago

Thank you for the (un)intentional hint!? When I read "global" it hit me... I was constructing TLDExtract incorrectly - I moved it to top of all code, outside my functions and everything works like a charm. Bug was between the keyboard and the chair. Thanks!

john-kurkowski commented 4 months ago

You're welcome!