john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 211 forks source link

setting PUBLIC_SUFFIX_LIST_URLS with environment variable #233

Open x0day opened 3 years ago

x0day commented 3 years ago

PUBLIC_SUFFIX_LIST_URLS now can only define with the function arguments, can this define by environment?

JohnOmernik commented 3 years ago

This would be extremely helpful for managed environments where https connections to the outside may not be possible. A container could be built with a current copy, and being able to provide this at the command line would be extremely helpful.

john-kurkowski commented 2 years ago

197 may be helpful in the meantime. Although it's CLI args, not an environment variable.

jpmckinney commented 6 days ago

I use Scrapy, which uses tldextract. I'd like to be able to set PUBLIC_SUFFIX_LIST_URLS, via an environment variable, to an empty array, so that it always either uses the cache or the snapshot. As of now, it seems to sometimes try to update the cache, and that request can fail.

john-kurkowski commented 3 days ago

Ok, I see, I'm into this! Maybe add a TLDEXTRACT_PUBLIC_SUFFIX_LIST_URLS environment variable check here, similar to the TLDEXTRACT_CACHE_TIMEOUT read above it. I'm thinking newline delimited URLs in that string. I agree the most common use will be to set the environment variable to the empty string.

john-kurkowski commented 2 days ago

Maybe add a TLDEXTRACT_PUBLIC_SUFFIX_LIST_URLS environment variable check here

For parity with the CLI, the parsed env var would specially handle local files. See these lines.