john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

"update" caches public suffix list to wrong directory #257

Closed nicholas-plutoflume closed 2 years ago

nicholas-plutoflume commented 2 years ago

Hi!

First off, I love tldextract, thanks for building it!

The way we use tldextract is slightly special, but used to be fully supported by the public API. Our docker containers don't have internet access, so when we build them, we cache the latest public suffix list. When our applications use tldextract, we configure it so that it uses the cache, and never needs an internet connection. Upon upgrading to any 3.* version of tldextract, I noticed that the cache was no longer being used to look up information from the public suffix list.

Problem reproduction steps

First, run the command: tldextract --update --private_domains Then create a basic test file:

import os
from tldextract import TLDExtract

extractor = TLDExtract(cache_dir=os.environ["TLDEXTRACT_CACHE"])
extractor("www.google.com")

Now, create a conditional breakpoint here, where the condition is that namespace equals publicsuffix.org-tlds.

Expected behaviour

When running the above program, the break point should be hit, but should not throw a KeyError.

Actual behaviour

The breakpoint is hit once during the __call__(…), and immediately throws a KeyError because it can't find the cache file.

Explanation

The method run_and_cache accepts a namespace, which is used to calculate the cache file path. But when the file is downloaded, it uses the hardcoded namespace "urls", which places the file in the wrong location.

I'll write a PR that fixes this problem.