john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 212 forks source link

Reconsider cache folder #212

Open john-kurkowski opened 4 years ago

john-kurkowski commented 4 years ago

Reconsider caching in the library's install folder. The GitHub issue tracker is rife with confusion about the permission warning (#9), or outright uncaught exceptions (#209). Finally do something about it. 😄

Some example approaches:

brycedrennan commented 4 years ago

We should make sure that the cache isn't shared between different versions of tldextract nor by different python virtualenvs even if they have the same version.

I'm working on a solution but can't promise a timeline.

bastbnl commented 4 years ago

Maybe also consider caching the content in a caching engine, like memcached. Or redis. Or elasticsearch?

Better yet: what about making the entire caching engine user extendable and ship it with a filesystem based engine, while opening it up to the advanced user?

brycedrennan commented 4 years ago

@bastbnl I'd rather see if there is a way to better hide this complexity from the user entirely. On the other hand, if you saw a way to enable that without adding a lot of complexity that could be interesting.

JohnOmernik commented 3 years ago

Honestly, I want to be able to provide a directory with a a cache file that is the exact copy of the downloaded URLs, and just have it work. I tried to pry apart what is happening with caching, but everytime I restart my kernel, the first cache attempt tries to use the URLs in the list, and I get an ugly error. I want this bootstrapable, so the URL list should be able to be read as a ENV variable, completely hidden from my users. I don't want them to have to figure out how to get the cache file, I will manage that behind our proxy, and I don't want them to have to run tldextract differently to use a different instantiation variable to not get the error. This is way to complex for managed installations where I am handling things for my users.