sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.58k stars 2.12k forks source link

Add support for custom HTTP headers @ linkcheck #7247

Closed webknjaz closed 4 years ago

webknjaz commented 4 years ago

Is your feature request related to a problem? Please describe.

Currently, Accept HTTP header is hardcoded: https://github.com/sphinx-doc/sphinx/blob/dbefc9865d8c2c4006ed52475d1bff865358cd00/sphinx/builders/linkcheck.py#L111. And when I hit servers that require custom headers, the only option is to add those URLs to the ignore list which is what I'd like to avoid.

Describe the solution you'd like

Make HTTP headers configurable.

Describe alternatives you've considered

Adding the affected URL to linkcheck_ignore

Additional context

We have a GitHub Actions badge in README which then gets embedded into Sphinx docs. Running linkcheck used to work but now it doesn't. After some debugging I discovered that if the HTTP query doesn't have Accept: HTTP header, it works. But the header that Sphinx injects causes GitHub's server to respond with HTTP/1.1 406 Not Acceptable. Interestingly, if you open this URL in a browser, it works: https://github.com/cherrypy/cheroot/workflows/Test%20suite/badge.svg. Google Chrome sends the following header: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9.

$ curl --head -H 'User-Agent: Sphinx/2.4.3 requests/2.23.0 python/3.7.4' https://github.com/cherrypy/cheroot/workflows/Test%20suite/badge.svg
HTTP/1.1 200 OK
date: Tue, 03 Mar 2020 18:53:13 GMT
content-type: image/svg+xml; charset=utf-8
server: GitHub.com
status: 200 OK
vary: X-PJAX, Accept-Encoding, Accept, X-Requested-With
cache-control: max-age=300, private
etag: W/"6e6be7ee648f0c6c3c74f436c281da7e"
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com www.google-analytics.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com *.githubusercontent.com; manifest-src 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com
Age: 0
Set-Cookie: _gh_sess=p238CMtx5HWH1dro34Ug5297UE6yfWFIdIXjOC%2Fz6c0KFat8kP6FKO%2BpnLDFOrOop4N%2FjA%2FnKLDavWjC6VVQYoPNNbqh%2B4N41map9mUfvFhhx8HMW19Du1h5fn9g2Tv4TZcNSJfwfFV465Xzxq9t213ud1LEQEukuzbcIFn1hNy%2FBbmJ%2BF0MjS6eZk%2BPVQ2kLNdrtaBz%2BJ6RFTwhyu7nrxXLbgh08T2mBKLI8BREu3%2Fh1f7S%2FJ%2BIaQFq5mFItrQ140%2BSDmMgWF7tGKuZqDnHYw%3D%3D--YFLr0%2B3yKMbqGo%2Ff--P2WJDemx1goxFvxleo%2FnsQ%3D%3D; Path=/; HttpOnly; Secure
Set-Cookie: _octo=GH1.1.1438747173.1583261593; Path=/; Domain=github.com; Expires=Wed, 03 Mar 2021 18:53:13 GMT; Secure
Set-Cookie: logged_in=no; Path=/; Domain=github.com; Expires=Wed, 03 Mar 2021 18:53:13 GMT; HttpOnly; Secure
Accept-Ranges: bytes
Content-Length: 2211
X-GitHub-Request-Id: 1C24:16DCA:5FBDEC6:880AF26:5E5EA799
$ curl --head -H 'Accept: text/html,application/xhtml+xml;q=0.9,*/*;q=0.8' -H 'User-Agent: Sphinx/2.4.3 requests/2.23.0 python/3.7.4' https://github.com/cherrypy/cheroot/workflows/Test%20suite/badge.svg
HTTP/1.1 406 Not Acceptable
date: Tue, 03 Mar 2020 18:53:49 GMT
content-type: text/html
server: GitHub.com
status: 406 Not Acceptable
vary: X-PJAX, Accept-Encoding, Accept, X-Requested-With
cache-control: no-cache
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
expect-ct: max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"
content-security-policy: default-src 'none'; base-uri 'self'; block-all-mixed-content; connect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com www.google-analytics.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-cloud.s3.amazonaws.com *.githubusercontent.com; manifest-src 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com
Age: 0
Set-Cookie: _gh_sess=cq2fhZutOVFanPybUxb%2F5FN5FRD9j%2FKOq2N5WN83m30t6Xnu8y1Zgcc4kBIw0MiYid9VOJTComfgw5O4jAWg91GLK0peYu9XfNKn2bPmd7GDmjYwak2QE%2FvElg%2BVs8yuL8lMOdtZSxAfQdObkQHyPM9KCs%2FXj7qofetrUASScJ2v%2BBdIw%2BUDANHDp%2FoH0ckbWIY4ouHQD%2BAy1KG00IMLjyRJ%2Fgr0V57JhemCUNk0pqscP7vFagUR%2BicETzEd2%2B%2Fy45pkpTTiwqds%2BFyoPoxn1g%3D%3D--Po2%2Boh3TsKnH2dDk--uLvCvDG7SDRtQP9jQ5%2B3Pw%3D%3D; Path=/; HttpOnly; Secure
Set-Cookie: _octo=GH1.1.1102872677.1583261629; Path=/; Domain=github.com; Expires=Wed, 03 Mar 2021 18:53:49 GMT; Secure
Set-Cookie: logged_in=no; Path=/; Domain=github.com; Expires=Wed, 03 Mar 2021 18:53:49 GMT; HttpOnly; Secure
Content-Length: 0
X-GitHub-Request-Id: 1E08:1FAA7:4596C76:6318A3E:5E5EA7BD
tk0miya commented 4 years ago

Confirmed. It seems better not to send Accept: header to GitHub. On the other hand, some server requires the header (see #5140). So it would be better to allow to customize it via code or configuration.

Just an idea, linkcheck_request_header might be helpful for such case:

linkcheck_request_header = {
    '*': {'Accept': 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8',}
    'https://github.com': {},
    ...
}
webknjaz commented 4 years ago

@tk0miya this looks like a good idea.

tk0miya commented 4 years ago

Oops, I've overlooked to work for this issue on the 3.0 release... I just set the milestone for this issue now.