Closed tbekolay closed 3 years ago
Reasonable. Could you make a PR please?
Hi. I’m trying to solving this.
The linkcheck
behavior is start worker threads, then queue all links it encounters in a queue.Queue
. Threads consume that queue to offer parallelization.
I built a solution based on priority queues detailed below, but I’m not satisfied with it and would like to increase the scope to use asyncio
. That has ramifications for the project and would like to discuss it before getting too deep into the implementation.
Replace the work Queue
with a queue.PriorityQueue
. The priority is the next timestamp when the link should be checked.
The priority is time.time()
for new links.
When a server replies with a 429, the thread receiving that response computes the next priority:
max_retries
for that link are reached, bail out. max_retries
is user controlled, per domain, can be different from linkcheck_retries
.PriorityQueue
, new priority computed as follows:
Retry-After
is an int: priority = time.time() + retry_after
Retry-After
is an HTTP date, priority = http_date_to_unix_timestamp(retry_after)
Retry-After
absent or invalid: priority = time.time() + exponential_backoff_starting_at_60
PriorityQueue
Each worker thread pulls from the queue. If priority is in the future, requeue the message with the same priority and go to sleep.
(new) With the need to sleep to honor rate-limits, threads exhibit two sub-optimal behaviors:
The added scheduling to handle 429 is a poor substitute for what an event loop has built-in. For example, loop.call_at()
would be very convenient.
requests
with aiohttp
.
requests.Response
to use an aiohttp.ClientResponse
. Both look pretty similar.aiohttp
expects a different input. Consider REQUESTS_CA_BUNDLE
, tls_cacerts
, auth_info
for the linkcheck_auth
setting.linkcheck_workers
setting.requests
with aiohttp
.linkcheck
:
asyncio.run_coroutine_threadsafe()
)If there’s interest in that plan, I’m happy to break the next steps into separate issues and tackle them.
To squeeze out even more performance for linkcheck
, threads could be introduced again: a ThreadPoolExecutor
could execute the coroutines. That introduces the complexity of sharing data across threads and is left for future work.
If there’s interest in that plan, I’m happy to break the next steps into separate issues and tackle them.
My large concern is who maintains it. I'm not good at asyncio and aiohttp. So it would be nice if you become a maintainer of the new linkcheck builder. What do you think?
Note: We have to care the new one is working fine on Windows too.
Thanks for the quick feedback. I don’t mind maintaining linkcheck
and helping out with aiohttp
. I use linkcheck
in factory_boy
and think it’s a very useful extension generally.
My day job is as a web developer (mostly Python and Django). I’ve been doing that for about 7 years, I’m pretty familiar with the Web and Python.
I’m new to async, but eager to work with it. This change is a good opportunity to grow more familiar with async, and fixing the (hopefully few) issues arising from this change will be a great learning experience.
If not being experienced with async beyond a couple personal testing projects is a big concern, I’m okay sticking with the multi-threaded solution, the priority queue and requests
. It just does not seem the best way to solve the problem, and async offers exciting possibilities.
Sounds good :-) Let's moving to new architecture!
Is your feature request related to a problem? Please describe. We link to Github PR/issues for each entry in our changelog, which is included in our documentation. That means that we fire off a hundred or so checks to various pages on Github. Whenever it's not running smoothly, we end up with timeouts, which slows down our development until Github is more reliable again.
Describe the solution you'd like Allow the linkcheck builder to retry requests a few times (maybe 3, or a configurable amount) with exponential backoff. https://www.peterbe.com/plog/best-practice-with-retries-with-requests gives a quick and easy implementation of how to do this with
requests.get
, which is what linkcheck currently uses to get pages.Describe alternatives you've considered Another thing that might help would be to collect all the links to check in one step, then sort them such that requests to the same domain happen more sequentially than requests to different domains. I.e., you would do one request to each unique domain first before you try a request to a domain you've already tried before.