psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.14k stars 9.33k forks source link

requests.get returns a response with the text doubled #3076

Closed hachterberg closed 8 years ago

hachterberg commented 8 years ago

In a rare instance, I managed to do a GET request to a server that should retrieve an xml document. However the response.text field contains the document twice (just concatenated, i.e. the first half of the data and second half of the data are exact duplicates):

>>> resp = requests.get('http://bigr-rad-xnat.erasmusmc.nl/schemas/xnat/xnat.xsd', auth=('username', 'password'))
>>> resp.text[:len(resp.text)/2] == resp.text[len(resp.text)/2:]
True

Now of course, it could happen that the document really is that, but it shouldn't and I tried the request using curl, postman (chromium extension), and chromium and it all returns just half the response of requests. As a final test I tried it using urllib2:

>>> import urllib2 base64
>>> req = urllib2.Request('http://bigr-rad-xnat.erasmusmc.nl/schemas/xnat/xnat.xsd')
>>> base64string = base64.encodestring(
        '%s:%s' % ('password', 'password'))[:-1]
        authheader =  "Basic %s" % base64string
        req.add_header("Authorization", authheader)
>>> handler = urllib2.urlopen(req)
>>> content = handler.read()

And that also gave me a single copy of the document. Hence I conclude something is probably going wrong in requests.

I am using requests version 2.9.1 on python 2.7.11 (and I also tried on 3.5.1) on debian stretch. The server is a tomcat server running XNAT (www.xnat.org). The problem is that the server serving this is located behind a hospital firewall and therefor I cannot give anyone from outside access to reproduce/test, but I can gather additional information.

I tried to look at the open bugs and did not find a similar report, but I could have missed it, if so I am sorry.

Lukasa commented 8 years ago

@hachterberg Can you check whether the same problem exists for resp.content?

hachterberg commented 8 years ago

@Lukasa The result is the same for resp.text and resp.content

Lukasa commented 8 years ago

Ok, that's very interesting. Are you familiar with the tool Wireshark? If you are, I'd like to see the differences between the request and response headers for the transaction using requests and the transaction using curl (you'll obviously want to scrub the Authorization header).

hachterberg commented 8 years ago

I never used wireshark, but I gave it a go. When i copy the asci text it prepends some binary mess it seems (the TCP part?), should I clean it up? I left it in for now.

The CURL request header:

P)E@@

2PVrv&P."GET /schemas/xnat/xnat.xsd HTTP/1.1
Host: bigr-rad-xnat.erasmusmc.nl
Authorization: Basic SCRUB=
User-Agent: curl/7.47.0
Accept: */*

the CURL response header:

P)dBAE!~@=pb

P2v&VrPNHTTP/1.1 200 OK
Date: Tue, 05 Apr 2016 14:28:10 GMT
Set-Cookie: JSESSIONID=SCRUB; Path=/
Set-Cookie: SESSION_EXPIRATION_TIME="1459866490593,900000"; Version=1; Path=/
Accept-Ranges: bytes
ETag: W/"154153-1452780004000"
Last-Modified: Thu, 14 Jan 2016 14:00:04 GMT
Content-Type: text/xml
Content-Length: 154153
Vary: Accept-Encoding
Connection: close

<?xml version="1.0" encoding="UTF-8"?>
[here the rest of the xml document]

The requests request header:

P)E,d@@"

vPgslP.dGET /schemas/xnat/xnat.xsd HTTP/1.1
Host: bigr-rad-xnat.erasmusmc.nl
Connection: keep-alive
Accept: */*
Accept-Encoding: gzip, deflate
Authorization: Basic SCRUB=
User-Agent: python-requests/2.9.1

The requests respone header:

)dBAE90DS@=/

PvlgtPfhHTTP/1.1 200 OK
Date: Tue, 05 Apr 2016 14:36:17 GMT
Set-Cookie: JSESSIONID=SCRUB; Path=/
Set-Cookie: SESSION_EXPIRATION_TIME="1459866977197,900000"; Version=1; Path=/
Accept-Ranges: bytes
ETag: W/"154153-1452780004000"
Last-Modified: Thu, 14 Jan 2016 14:00:04 GMT
Content-Type: text/xml
Vary: Accept-Encoding
Content-Encoding: gzip
Connection: close
Transfer-Encoding: chunked

3893
[big mess of binary from here on]
Lukasa commented 8 years ago

The problem here seems to be the server. The big difference is that requests sends Accept-Encoding: gzip, deflate, where curl does not. That causes the PHP server in this case to dramatically change the response it sends: rather than using a Content-Length framed response it uses a Transfer-Encoding: chunked header and a gzip-encoded body.

This means that almost certainly the server is getting this wrong.

You should be able to avoid it by changing your requests call to add headers={'Accept-Encoding': None}. Of course, I recommend you contact the web server operator to get them to fix their behaviour.