psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.11k stars 9.32k forks source link

requests really long compared to curl #4883

Closed tobiasBora closed 3 years ago

tobiasBora commented 5 years ago

First thanks for the great job.

For some specific requests, requests is very slow (around 8s in average) while curl configured with more or less the same parameters is nearly instantaneous.

Expected Result

I would expect requests to take less than 8s to proceed a request. I tried to change Connection: close/keep-alive, and Accept-Encoding: identify/gzip/gzip, deflate/...

Reproduction Steps

On one side run:

curl -X 'GET' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0' -H 'Accept-Encoding: identity' -H 'Accept: application/json, text/plain, */*' -H 'Connection: close' -H 'Referer: https://www.oui.sncf/bons-plans/tgvmax' -H 'X-CT-Locale: fr' -H 'X-User-Agent: CaptainTrain/1542980776(web) (Ember 3.4.6)' -H 'content-type: application/json' -H 'x-not-a-bot: i-am-human' 'https://www.trainline.fr/api/v5_1/stations?context=search&q=Paris'

and on the other side:

import requests
from collections import OrderedDict

###### Debug tools
def pretty_string_request(req):
    """Pretty string a request"""
    return '{}\n{}\n{}\n\n{}\n{}'.format(
        '>>>-----------START-----------',
        req.method + ' ' + req.url,
        '\n'.join('{}: {}'.format(k, v) for k, v in req.headers.items()),
        req.body,
        '<<<-----------STOP-----------',
    )

def pretty_string_response(response):
    return pretty_string_request(response.request)

###### Code

# Does not order anything in fact
headers = OrderedDict([('Accept', 'application/json, text/plain, */*'),
                       ('Accept-Encoding', 'identity'),
                       ('Connection', 'close'),
                       ('Referer', 'https://www.oui.sncf/bons-plans/tgvmax'),
                       ('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0'),
                       ('X-CT-Locale', 'fr'),
                       ('X-User-Agent', 'CaptainTrain/1542980776(web) (Ember 3.4.6)'),
                       ('content-type', 'application/json'),
                       ('x-not-a-bot', 'i-am-human')])

url = 'https://www.trainline.fr/api/v5_1/stations'
payload = {'context':'search',
           'q': 'Paris' }
print("Before request...")
response = requests.get(url, headers=headers, params=payload)
print("After request...")
print(pretty_string_response(response))
print(response.json())

I also took a wireshark capture (the first packets until N=52 included are from curl, the others are from requests). For me, the server shouldn't even be able to distinguish wether on the other side you have curl or requests as they are supposed to send the same headers...

screenshot_20181128_060532

EDIT: I also noticed that I can keep only the header 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0' and this time requests is as quick as curl. However, problems arrive in two scenari: either I add the header 'Connection': 'close' or the header 'content-type': 'application/json', while both options are fine with curl. Maybe is could be useful to understand why requests does not always behave as curl?

System Information

$ python -m requests.help
$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.3"
  },
  "idna": {
    "version": "2.7"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.6.7"
  },
  "platform": {
    "release": "4.18.0-2-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010008f",
    "version": "18.0.0"
  },
  "requests": {
    "version": "2.20.0"
  },
  "system_ssl": {
    "version": "1010100f"
  },
  "urllib3": {
    "version": "1.24"
  },
  "using_pyopenssl": true
}

This command is only available on Requests v2.16.4 and greater. Otherwise, please provide some basic information about your system (Python version, operating system, &c).

rHuggler commented 5 years ago

@kenneth-reitz @timofurrer do you know who can take a look into this?

vido-retake commented 5 years ago

Hi guys, i'm facing a similar issue. requests.get() takes ~520s. It used to be <1s. I don't know what changed to cause this huge delay. I can provide you the cProfile output.

UdayShankar517 commented 4 years ago

@kenneth-reitz, what could be the issue? Please explain the troubleshooting steps that we can follow to fix this.

Lekensteyn commented 4 years ago

I reproduced the high response time, but it does not appear to be an issue with the requests library.

It is probably some anti-crawler protection from the server. Perhaps it performs fingerprinting and recognizes that the exhibited behavior (use of HTTP/1.1) is unlikely match to match the Firefox user agent (which uses HTTP/2), and therefore decides to throttle the client.

This packet capture (trainline-dsb.pcapng.zip) captures three cases and includes the TLS decryption secrets that allows Wireshark to decrypt the result:

To find the responses, use the "json" display filter. To study the whole trace, you can use the "tls" display filter.

I believe this issue should be closed as it is not a defect in the requests library.

nateprewitt commented 3 years ago

@Lekensteyn is correct about the issue. The fact that changing the user-agent resolved the issue is a clear sign that this is a server side defense mechanism. There isn't anything for Requests to do in this case.