john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired #254

Open rudolfovic opened 2 years ago

rudolfovic commented 2 years ago

I start getting this error when I increase the number of processes / threads to a certain point.

Is there a way to increase the timeout value?

More importantly, why is lock needed here if tldextract isn't writing anything, only reading?

brycedrennan commented 2 years ago

It's fetching and saving the latest version of the top level domains list.

A lock is to prevent multiple threads and processes from needlessly requesting the data and then contending as they write the data to the same location.

Timeout is currently set to 20 seconds. https://github.com/john-kurkowski/tldextract/blob/40205f67df5f59df4b88ce47bbbe98f1eff36230/tldextract/cache.py#L78

I'd suggest either disabling the list update or doing it beforehand and then disabling it. See the readme for details.

rudolfovic commented 2 years ago

I did look at it but it's not too clear to me whether cache_dir=False disables writing to the cache (downloading new info) vs reading from the cache (fetching directly from the internet) in these examples:

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')

I don't feel a need for any custom path so in which order would you run tldextract.TLDExtract() and tldextract.TLDExtract(cache_dir=False)?

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

john-kurkowski commented 2 years ago

I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.

jordane95 commented 3 months ago

I have trouble cobbling together when this project fetches and when it caches too. And I wrote much of the project. 😅 The docs could be improved.

If I understand correctly, I need to fetch some random domain using the first instance (like google.com) first and then use the second instance for my parallel task. Is that it? (or would it be enough to simply create the object to perform the fetch?)

Close. You only need the first instance. After you call the instance once (with say google.com), the list is fetched and cached, and future calls to that same instance should have no contention.

In distributed setting, different workers still need to re-initialize their own instances. So, I think this won't work in this case? Should we just put the dry run in the source code?