sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.35k stars 2.07k forks source link

Enable async capabilities for HTTP requests #8391

Open francoisfreitag opened 3 years ago

francoisfreitag commented 3 years ago

Is your feature request related to a problem? Please describe. From #6629, the linkcheck command is currently using threads to concurrently check the status of links in the documentation. Threads are not the most efficient way to concurrently check links: once all threads are busy waiting, the work queue stops being consumed.

Describe the solution you'd like

Using an event loop allows the URL verifier to yield control to another coroutine until it gets a response. That means a single thread is able to send multiple requests concurrently and process the response as they arrive. It also facilitates handling rate-limiting, because a coroutine can be scheduled to run in the future.

The first step toward using asynchronous concurrency in linkchecker is to replace the requests library uses with an async-compatible HTTP library. The aiohttp library has an API pretty similar to that of requests and is well-established and under active development, it seems like a good choice.

Describe alternatives you've considered

Tried handling rate limits with a PriorityQueue as described in https://github.com/sphinx-doc/sphinx/issues/6629#issuecomment-723440073.

TODO

francoisfreitag commented 3 years ago

I’m hitting an issue with this work: HTTP authentication.

linkcheck_auth documents its auth_info parameter as

anything that is understood by the requests library (see requests Authentication for details).

The Requests library supports advanced authentication schemes, such as Digest auth or potentially even NTLM through https://github.com/requests/requests-ntlm. Both require multiple HTTP request-response cycle for authentication.

aiohttp only supports HTTP basic auth, and the discussion to support more advanced authentication protocols seems to have reached a standstill. It is unclear whether more authentication methods will be supported the future.

Based on the PR that introduced the linkcheck_auth, I believe basic auth was the intended use case. But with the documentation “anything that is understood by the requests library”, I would be surprised if dropping support for the digest auth or NTLM authentication method does not break someone’s doc.

Is dropping support for complex authentication an option? I doubt many users hide their doc behind complex authentication schemes, but someone somewhere certainly does that. FWIW, intersphinx (the only other extension to use Requests) only mentions basic authentication. What do you think?

francoisfreitag commented 3 years ago

An option could be to use aiohttp for most cases, public URLs or URLs protected with basic auth, and make HTTP requests that require a complex authentication scheme synchronously with requests. That means maintaining integration of two similar but slightly different HTTP libraries in the project. I don’t think if it’s worth the complexity, but throwing the idea out there for discussion.

tk0miya commented 3 years ago

Personally, I don't object to drop support them (I'd forgotten we have such a feature!). But I can't say nobody uses some complex authentication. I guess nobody uses it, but...

francoisfreitag commented 3 years ago

Another backward incompatibility: AIOHTTP does not return a requests.Response object. Extensions that expect such objects are likely to break.

A wrapper can be made for the most common bits (status code, url, encoding, content, etc.), but it cannot cover all methods from requests.Response, in particular not chunked reads:

The current plan for synchronous requests with AIOHTTP is to schedule a coroutine that makes the HTTP request on a new event loop, blocks until the event loop completes and returns the result. The AIOHTTP response object is only available during the lifespan of the event loop, that makes chunked reads impossible in a synchronous context.

AA-Turner commented 1 year ago

@jayaddison would you be interested in looking into this? httpx may be a good candidate as I believe it looks to keep the interface of Requests.

Alternativley, we create a new sphinx.util._async_http helper module instead of replacing util.requests.

To François' point above, I don't know how AIO-HTTP has developed w.r.t. authentication in the last three years.

A

francoisfreitag commented 1 year ago

I’ve been using httpx for a while now, and it’s been an easy switch from requests at $DAYJOB. That would probably ease the transition. :+1:

jayaddison commented 1 year ago

@AA-Turner:

@jayaddison would you be interested in looking into this? httpx may be a good candidate as I believe it looks to keep the interface of Requests.

Thanks; I'm tempted - I think I should try to learn more about async programming Python beforehand, though.

I'm also not sure I'm quite there in terms of familiarity with the existing linkchecker. Learning, but not confident that I understand it well enough to make significant changes yet.

@francoisfreitag:

TODO

The first task about expanding test coverage rings true. I'd suggest performance testing (throughput and resource usage) measurement too, so that we learn about any regressions.

What I'm less certain about and still thinking through is what the best path might be, after further coverage is available:

(my sense of my development experience recently is that I tend to lean more towards debugging, maintenance and fixes, and so that'd lead me towards the second option. but I don't know how much time I can commit, and don't want to get in the way of other approaches)

francoisfreitag commented 1 year ago

The test coverage has been expanded quite a bit since I wrote that issue. The goal at the time was to better handle rate limiting (esp. from GitHub), and async provides a nicer way to schedule the next check time than the current threads waking up to verify nothing needs doing, then sleeping some. Refs #6629

I would probably go the second route as well. Because of the my findings 2/3 years ago, I decided to opt for the threaded version, which didn’t need to change requests. It can totally be reconsidered, since httpx seems like a viable alternative that supports async. I’m not sure how much compatibility there is between httpx and requests plugins, and how much efforts should be put in preserving the backward compatibility with complex auth (e.g. NTLM).

Ousret commented 4 months ago

There's another candidate that aim to be a drop-in replacement for requests, see niquests. plugins should work as is and allow a progressive migration. if needed, I can assist.

Ousret commented 4 months ago

The multiplexing aspect of Niquests may come in handy to avoid converting the whole code into async immediately and make a significant gain.