john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Creating new publicsuffix file with filled suffix_list_urls #331

Open AnnaSummer opened 1 month ago

AnnaSummer commented 1 month ago

Hello! I create object of TLDExtract:

extractor = tldextract.TLDExtract( suffix_list_urls=["file://" + "/absolute/path/to/json/with/custom/public/suffixes"], cache_dir='/absolute/path/to/cache/dir', fallback_to_snapshot=False )

When I call print(extractor("google.ac")), new empty (!) files (lock and json) with public suffixes are created (identifying by new hash in name) and this file is used as a new public suffixes file.

In source code in cache.py (line 108):

cache_filepath = self._key_to_cachefile_path(namespace, key)

method _key_to_cachefile_path create new hash, that is used as a filename of new public suffix file.

Is it a correct behaviour? I just need to avoid HTTP requests to update cache of tlds (in any time including first run).

john-kurkowski commented 1 month ago

When I call print(extractor("google.ac")), new empty (!) files (lock and json) with public suffixes are created

Hello! Are you asking if creating empty cache files is expected behavior? I would say no, that's not expected. Only the files ending in *.lock should be empty, 0 bytes.

Even if this library caches 0 public suffixes, the cache file should be a few bytes, containing the JSON [[], []]. Then calling extractor("google.ac") would raise ValueError: No tlds set. Cannot proceed without tlds..

john-kurkowski commented 1 month ago

So I'm not sure what's going on in your case. Aside, if your file is local anyway, you could play with avoiding the cache step. The cache won't save a ton of processing for a local file.

tldextract.TLDExtract(
    suffix_list_urls=["file://" + "/absolute/path/to/json/with/custom/public/suffixes"],
    cache_dir=None,
    fallback_to_snapshot=False
)
AnnaSummer commented 1 month ago

Thank you for answer! I need to avoid tlds loading from the Internet (including at first run of library methods), just from local file (only). I tried to call TLDExtract with your parameters, but get "ValueError: No tlds set. Cannot proceed without tlds."

My local file with public suffixes is not empty (there are tlds in that json).

john-kurkowski commented 1 month ago

there are tlds in that json

That could be the problem. The suffix_list_urls expects to open URLs with plaintext files formatted like the Public Suffix List. Notice that linked file is not JSON.