readthedocs / readthedocs.org

The source code that powers readthedocs.org
https://readthedocs.org/
MIT License
8.06k stars 3.59k forks source link

Requests 403 Client Error #11763

Open bjlittle opened 2 weeks ago

bjlittle commented 2 weeks ago

Details

Expected Result

We've successfully been using pooch to download various external assets required to build the sphinx-gallery of our documentation.

However, we're now getting a 403 Client Error. Has there been a very recent RTD server-side change that may be causing this?

Actual Result

For further details see https://readthedocs.org/projects/geovista/builds/26264004/

humitos commented 2 weeks ago

It seems it's getting 403 when trying to download https://raw.githubusercontent.com/bjlittle/geovista-data/2024.10.2/assets/natural_earth/physical/ne_coastlines_10m.vtk.bz2. However, I'm able to download that file without issues.

I'd say it was a temporary error on the GitHub side. This doesn't seems related to Read the Docs.

bjlittle commented 2 weeks ago

Thanks @humitos for getting back so quickly :100:

I'm able to successfully wget this file and also use pooch to download this file too.

How are you replicating this issue on your side? And what are you getting? 403 also?

humitos commented 2 weeks ago

How are you replicating this issue on your side?

Just clicking on that link, it works and download the file.

bjlittle commented 2 weeks ago

Okay, I think this is a rate limiting related issue on the GH server side when pulling assets from GH to RTD.

I'm just going to close this issue, thanks again @humitos :beers:

kmuehlbauer commented 2 weeks ago

This is a persisting issue and is affecting more packages which build docs on rtd an use pooch to retrieve assets. It's hard to debug if this only happens on rtd and not locally or in other setups. What I can tell, it already fails on the first fetch of an asset, so I doubt the rate limiting theory.

Here are two more links to very recent issues:

@humitos How would we get to the bottom of this? Do you see a way to debug this from your side. I was assuming some other dependency issue but did not spot any recent changes so far. One problem for debugging from user side is the missing mamba environment listing (can we activate that somehow?).

humitos commented 2 weeks ago

@kmuehlbauer I don't have a different way to debug this from my side. I would recommend you first to create a minimal reproducible example that generates the issue --outside the environment of your project.

One problem for debugging from user side is the missing mamba environment listing (can we activate that somehow?).

What is "mamba environment list"? If I understand correctly you refer to the list of packages installed in the environment. If that's correct, you can run mamba list using https://docs.readthedocs.io/en/latest/build-customization.html#extend-the-build-process

kmuehlbauer commented 2 weeks ago

Thanks @humitos, that's helpful. Would you mind opening the issue again, until the root cause is found?

kmuehlbauer commented 2 weeks ago

@humitos I've distilled the issue to just use https://github.com/readthedocs/tutorial-template and requests together with a readthedocs github resource.

Pull Request

https://github.com/kmuehlbauer/pooch_rtd_issue/pulls

Build logs

https://app.readthedocs.org/projects/pooch-rtd-issue/builds/26278637/

Code

output_file = open("output_file.nc", "w+b")
url = "https://github.com/readthedocs/readthedocs.org/raw/refs/heads/main/docs/dev/code-of-conduct.rst"
print("downloading: ", url)
try:
    response = requests.get(url, timeout=30, allow_redirects=True)
    response.raise_for_status()
    output_file.write(response.content)
finally:
    output_file.close()

Something is broken between RTD and GitHub and this is as far as I can get. I'd appreciate of you could sort this out with GitHub, as this seems to be a problem of RTD builds. Thanks!

ericholscher commented 2 weeks ago

Hrm, I just tried to reproduce this on our build servers with a shell, and I got a 200:

docs@build-default-i-00923ec13a48b3b12(org):~/checkouts/readthedocs.org$ python
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> 
>>> output_file = open("output_file.nc", "w+b")
>>> url = "https://github.com/readthedocs/readthedocs.org/raw/refs/heads/main/docs/dev/code-of-conduct.rst"
>>> print("downloading: ", url)
downloading:  https://github.com/readthedocs/readthedocs.org/raw/refs/heads/main/docs/dev/code-of-conduct.rst
>>> try:
...     response = requests.get(url, timeout=30, allow_redirects=True)
...     response.raise_for_status()
...     output_file.write(response.content)
... finally:
...     output_file.close()
... 
4401
>>> print(response.status_code)
200

So it isn't something that's totally broken 🤔

ericholscher commented 2 weeks ago

It looks like other folks have been having a similar issue: https://stackoverflow.com/questions/39907742/github-api-is-responding-with-a-403-when-using-requests-request-function

Is it related to this perhaps?

ericholscher commented 2 weeks ago

I'm wondering if this was a temporary networking issue or something, since I can't seem to reproduce it on our build servers at all. Curious if you rebuild your test repo if it still fails?

ericholscher commented 2 weeks ago

Hrm, I was able to reproduce it in the build... https://app.readthedocs.org/projects/eric-pooch-rtd-issue/builds/26282150/

ericholscher commented 2 weeks ago

I updated it to print out the error: https://app.readthedocs.org/projects/eric-pooch-rtd-issue/builds/26282177/

51 | <h1>Access to this site has been restricted.</h1>
52 |  
53 | <p>
54 | <br>
55 | If you believe this is an error,
56 | please contact <a href="https://support.github.com">Support</a>.
57 | </p>

I guess we need to contact GitHub support 🙃

jrbourbeau commented 2 weeks ago

👋 I ran into the same problem over in the Dask docs (xref https://github.com/dask/dask/pull/11522). Stack overflow suggested setting a custom User-Agent in the header (happened here https://github.com/dask/dask-sphinx-theme/pull/91), which seems to have fixed things (docs build is passing again).

Though fixing things on the GitHub side would be much more convenient : )

ericholscher commented 2 weeks ago

Yea, looks like the issue is the lack of a user agent. When I updated the example to pass a user agent, it works:

https://github.com/ericholscher/pooch_rtd_issue/blob/768198a654f68bedd85d30854a7a7a9af893ba4a/docs/source/conf.py#L42-L50

https://app.readthedocs.org/projects/eric-pooch-rtd-issue/builds/26282216/

Guessing this is GH getting hammered by AI bots, and restricting requests without agents, like the rest of us.

kmuehlbauer commented 2 weeks ago

Thanks @ericholscher for testing it and @jrbourbeau for the solution. It's a pity that RTD and GitHub have to take these countermeasures, but totally reasonable and understandable.

Thanks a bunch :heart:

kmuehlbauer commented 2 weeks ago

I updated it to print out the error: https://app.readthedocs.org/projects/eric-pooch-rtd-issue/builds/26282177/

51 | <h1>Access to this site has been restricted.</h1>
52 |  
53 | <p>
54 | <br>
55 | If you believe this is an error,
56 | please contact <a href="https://support.github.com">Support</a>.
57 | </p>

@ericholscher Would it make sense to come up with a solution to provide some auth token to the request? At least for those builds which have been triggered/authorized with GitHub this might be possible for requests which reach out to GitHub resources. Any thoughts?

mathause commented 2 weeks ago

For pooch the you have to define a downloader:

downloader = pooch.HTTPDownloader(headers={"User-Agent": "agent"})

return REMOTE_RESSOURCE.fetch(name, downloader=downloader)