psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.21k stars 9.34k forks source link

requests raises requests.exceptions.ReadTimeout: HTTPConnectionPool while other libraries work fine #6064

Closed 5j9 closed 2 years ago

5j9 commented 2 years ago

Consider the following script:

from requests import Session
from time import sleep

print('requests')
session = Session()
url = 'http://tsetmc.com/Loader.aspx?ParTree=15'
r = session.get(url, timeout=5)  # 200 OK
print(r.status_code)

sleep(200)  # if the idle time is greater than ~120 seconds, then the next `session.get` attempt will fail

r = session.get(url, timeout=5)
print(r.status_code)

the above script fails with:

requests
200
Traceback (most recent call last):
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py", line 449, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py", line 444, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 1374, in getresponse
    response.begin()
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\http\client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py", line 440, in send
    resp = conn.urlopen(
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen
    retries = retries.increment(
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\util\retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\packages\six.py", line 770, in reraise
    raise value
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py", line 451, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\urllib3\connectionpool.py", line 340, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='tsetmc.com', port=80): Read timed out. (read timeout=5)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\a\AppData\Roaming\JetBrains\PyCharmCE2021.3\scratches\scratch_3.py", line 13, in <module>
    r = session.get(url, timeout=5)
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 542, in get
    return self.request('GET', url, **kwargs)
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\a\AppData\Local\Programs\Python\Python310\lib\site-packages\requests\adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='tsetmc.com', port=80): Read timed out. (read timeout=5)

Process finished with exit code 1

I believe there is some issue with how requests retries connections from the connection pool.

Apparently similar script works fine when using other libraries. I've tried the following:

import aiohttp
import asyncio

print('aiohttp')
url = 'http://tsetmc.com/Loader.aspx?ParTree=15'

async def main():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://httpbin.org/get') as resp:
            print(resp.status)
        await asyncio.sleep(200)
        async with session.get('http://httpbin.org/get') as resp:
            print(resp.status)

asyncio.run(main())
import urllib3
from time import sleep

print('urllib3')
http = urllib3.PoolManager()
url = 'http://tsetmc.com/Loader.aspx?ParTree=15'

resp = http.request('GET', url)
print(resp.status)

sleep(200)

resp = http.request('GET', url)
print(resp.status)
import httpx
from time import sleep

print('httpx')
client = httpx.Client()
url = 'http://tsetmc.com/Loader.aspx?ParTree=15'

r = client.get('https://example.com')
print(r.status_code)

sleep(200)

r = client.request('GET', url)
print(r.status_code)

Expected Result

requests should be able to handle the underlying situation like other libraries.

System Information

$ python -m requests.help
{
  "chardet": {
    "version": null
  },
  "charset_normalizer": {
    "version": "2.0.10"
  },
  "cryptography": {
    "version": "36.0.1"
  },
  "idna": {
    "version": "3.3"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.10.2"
  },
  "platform": {
    "release": "10",
    "system": "Windows"
  },
  "pyOpenSSL": {
    "openssl_version": "101010df",
    "version": "22.0.0"
  },
  "requests": {
    "version": "2.27.1"
  },
  "system_ssl": {
    "version": "101010df"
  },
  "urllib3": {
    "version": "1.26.8"
  },
  "using_charset_normalizer": true,
  "using_pyopenssl": true
}
nateprewitt commented 2 years ago

Hi @5j9, Requests uses urllib3 under the hood so this issue appears specific to how the service is handling calls from the Requests user-agent. If you look through closed issues you'll find it's very common practice for web servers to restrict access via Requests due to abusive scraper behavior. This isn't something we provide support for but is widely answered on platforms such as StackOverflow.

5j9 commented 2 years ago

Hi @nateprewitt , I don't belive that user-agent is the key here. I retested my urllib3 script above with an additional headers={'User-Agent': 'python-requests/2.27.1'} parameter. It was still able to communicate properly.

import urllib3
from time import sleep

print('urllib3')
http = urllib3.PoolManager()
url = 'http://tsetmc.com/Loader.aspx?ParTree=15'

resp = http.request('GET', url, headers={'User-Agent': 'python-requests/2.27.1'})
print(resp.status)

sleep(200)

resp = http.request('GET', url, headers={'User-Agent': 'python-requests/2.27.1'})
print(resp.status)

# will print 
# 200
# 200

Also, it does not seem to be a case of restricting access to requests, if it was so, why would the first request succeed and only the second request fail with a timeout? It does not make sense to me, if the server wanted to block requests it could have done so on the initial attempt.

5j9 commented 2 years ago

I might be wrong, but I think I've found the culprit: https://github.com/psf/requests/blob/95f456733656ed93645ff0250bfa54f6d256f6fe/requests/adapters.py#L117

As you can see, requests has set DEFAULT_RETRIES to 0. I guess all other libraries retry when facing a failed connections from the connection pool:

https://github.com/urllib3/urllib3/blob/f0dffb4e2437cb2da2ba0a6bbea6211f6fd0fa4b/src/urllib3/util/retry.py#L526 https://github.com/encode/httpcore/blob/54567ac1df3761c14f50f2cf55769921f60cd8b3/httpcore/_sync/connection_pool.py#L238

Mounting an HTTPAdapter with a retry value other than 0 fixed the issue for me. All I had to do was:

from requests.adapters import HTTPAdapter, Retry

session = Session()

retries = Retry(total=1)
session.mount('http://', HTTPAdapter(max_retries=retries))
...
5j9 commented 2 years ago

In HTTP 1.1, all connections are considered persistent unless declared otherwise. However, many HTTP servers use a timeout for connections.1 Since a client has no way of knowing a connection has been dropped by the server in such cases, it sounds only logical to me for client to retry any apparently closed connection from the connection pool instead of raising an error. Thus, I think requests should change the default retry value of 0 for DEFAULT_RETRIES or implement some other way to retry on closed connections.