psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.19k stars 9.33k forks source link

Frequently hangs or times out when trying to fetch https://openssl.org/source/ #6755

Closed dschepler closed 4 months ago

dschepler commented 4 months ago

As part of a software version webscraper I use requests. However, recently, it's started to be the case that for one particular site, https://openssl.org/source/ , it frequently hangs or times out when trying to fetch that page.

I was able to reproduce this simply by running: python3 -c 'import requests; x = requests.get("https://openssl.org/source/", timeout=30)'

Expected Result

Should finish the request promptly (unless, of course, there's some connectivity issue; however, that doesn't seem to be the case since wget, curl, and Firefox have no issues fetching the same page).

Actual Result

Frequently, the fetch fails due to a timeout. Traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 748, in _error_catcher
    yield
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 873, in _raw_read
    data = self._fp_read(amt, read1=read1) if not fp_closed else b""
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 856, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/http/client.py", line 479, in read
    s = self.fp.read(amt)
        ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/socket.py", line 708, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/ssl.py", line 1252, in recv_into
    return self.read(nbytes, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/ssl.py", line 1104, in read
    return self._sslobj.read(len, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/requests/models.py", line 820, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 1060, in stream
    data = self.read(amt=amt, decode_content=decode_content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 949, in read
    data = self._raw_read(amt)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 872, in _raw_read
    with self._error_catcher():
  File "/usr/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/usr/lib/python3.12/site-packages/urllib3/response.py", line 753, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.") from e  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='openssl.org', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/requests/sessions.py", line 746, in send
    r.content
  File "/usr/lib/python3.12/site-packages/requests/models.py", line 902, in content
    self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/requests/models.py", line 826, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='openssl.org', port=443): Read timed out.

(If no timeout is specified, then it seems to hang indefinitely -- at least for several hours.)

Reproduction Steps

import requests
x = requests.get("https://openssl.org/source/", timeout=30)

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "5.2.0"
  },
  "charset_normalizer": {
    "version": "3.3.2"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "3.7"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.12.4"
  },
  "platform": {
    "release": "6.9.7",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.32.3"
  },
  "system_ssl": {
    "version": "30300010"
  },
  "urllib3": {
    "version": "2.2.2"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": false
}

(Note that I'm not sure whether this is actually an issue with requests, or maybe just the openssl.org website doing something weird. However, I haven't been able to reproduce similar problems using curl, wget, or Firefox.)

nateprewitt commented 4 months ago

Please review the documentation on using Requests with respect to timeouts.

dschepler commented 4 months ago

OK, x = requests.get("https://openssl.org/source/", timeout=30) does seem to time out instead of hanging indefinitely. The question remains, though, why it frequently times out when none of the command line tools or web browsers I've tested have any such issues fetching that page.

sigmavirus24 commented 4 months ago

It never hangs for me or times out. This no way for us to debug this for you or provide you an answer