psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
51.98k stars 9.29k forks source link

should a malformed non-gzip yet marked gzip response be handled in requests? #3840

Closed jvanasco closed 7 years ago

jvanasco commented 7 years ago

Working on a PR, I discovered that a decently sized CDN will usually block the requests library via user-agent, but does so with a malformed response that raises an error. (usually means hitting the CDN from an ip address using a non-blocked user-string seems to whitelist the IP for 120 seconds -- which is why this took forever to figure out).

Their 403 response is malformed, as the header indicates a gzipped encoding

{'Content-Length': '345', 'Content-Encoding': 'gzip', 'Content-Type': '*/*'}

however the payload is uncompressed plain-text:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <title>403 - Forbidden</title>
    </head>
    <body>
        <h1>403 - Forbidden</h1>
    </body>
</html>

This raises a DecodeError/ContentDecodingError in Response.iter_content.

I can provide a test case. I just don't want to name the CDN or a client as some project maintainers have (un)official policies on stuff like that.

This is definitely a "bad server". Aside from sending the malformed response, they don't respect 'Accept-Encoding' either.

With this particular CDN, the payload is not chunked and reading the stream with decode_content=False will work.

I'm not sure how/if this should be handled. It might be nice to have a fallback where a failure to read compressed data will attempt to read it uncompressed.

Lukasa commented 7 years ago

I don't think this should be handled directly. This is strictly an error and the CDN needs to sort out their mess. I have no problems naming names on the Requests issue tracker, but if you'd rather not you should reach out to me directly via email and let me know who it is: the odds are good that either I know someone who works there or that I know someone who knows someone. Bonus points for a repro script!

JordanP commented 7 years ago

Yeah, stacking workarounds on bad server behaviors doesn't seem the good approach. The maintenance burden will bit eventually.

jvanasco commented 7 years ago

This appears to effect several domains on Verizon's "Enterprise Content Delivery" network -- which was EdgeCast before acquisition. The response headers for their servers are ecd ({release_id}), and are across a decently sized IP block.

I was testing a PR candidate for redirects against the first short link on my twitter feed, and this issue came up. I dug in and found several domains this happens on, but a good example is the first one I found -- the huffington post.

Here's a streamlined reproduction that should work:

import requests

# this will raise a ContentDecodeError
url = "http://www.huffingtonpost.com/entry/senate-democrats-boycott_us_5890ae26e4b0c90eff001c3c?v35c9pbty5f5vobt9"
r = requests.get(url)
print r
print r.content

There is a lot of extended behavior going on. I currently think the bit of the querystring I left intact above is some sort of auth payload, and I wasn't dealing with IP whitelisting but cache expiration. It appears that if you can load a url into their CDN cache (which depends on User-Agent and/or the URL's querystrings; "base" urls without querystrings always pass), you then enable most representations of that URL to all clients in the region for 2 minutes. I wrote about 30 lines of text and some code illustrating that, but I'll save you from the boredom ;)

Lukasa commented 7 years ago

Yeah, so this is a bit weird: I can't reproduce that here. Presumably this is somewhat specific to certain caches. Want to try using a few different source address blocks to see if it is consistently reproducible?

JordanP commented 7 years ago

FWIW, the reproducer worked for me.; I did get a requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',))

jvanasco commented 7 years ago

Can you two share your global regions? I tested this from 3 servers in the USA: NYC, New Jersey and Los Angeles.

Lukasa commented 7 years ago

I'm London based, so almost certainly using different DCs.

JordanP commented 7 years ago

I am in Paris. From work (in Paris too, but with a different network provider) the above code snipped raised an exception. From home, it return a properly formatted 403 error...

L2501 commented 7 years ago

seems to be a popular method to block non-browsers requests.get('http://www.bcast.site/stream.php')

azotlikid commented 7 years ago

Hello, I get these errors with some urls

examples :

 requests.get('http://superuser.com/robots.txt')
 requests.get('http://askubuntu.com/robots.txt')
 requests.get('http://stackoverflow.com/robots.txt')

but fetching doesn't fail with urllib3 only :

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET','http://superuser.com/robots.txt')
>>> r.status
200
>>> r.data
b"User-Agent: *\r\nDisallow:     ....
Lukasa commented 7 years ago

Ok folks, let's please try to restrict this issue. User agent spoofing is usually enough to solve this problem.

azotlikid commented 7 years ago

I'm sorry but agent spoofing does not solve this problem. You can try any headers you want, it will lead to the same error. Also urllib3 (on which requests is build on) can fetch data without any issue. Hence this is a bug a priori. If it is not, can you explain why please ?

Lukasa commented 7 years ago

@azotlikid Well, let's be clear, it's not a priori a bug because it works fine for me using the current Requests build.

However, if your problem really is the same as the one discussed in this issue then there is a bug, but it is server side: it is sending invalid gzipped data. The server is at fault, and if the data is invalid then there is little scope for us to avoid barfing on it.

jvanasco commented 7 years ago

FWIW, @azotlikid's examples work fine for me now too. It did not work the other day from this same location. I had looked up the IP blocks the other day, and they were hosted on the Fastly CDN -- and still are. They may have corrected things on their system. I'll reach out to someone at StackExchange I know through a project to see if I can find out anything.

It would still be nice if a future branch had some sort of design detail where bad-servers could be more cleanly caught and handled by developers or the errors could bubble-up a bit better.

Having gone though the code recently to address the "gzip header when not really gzip", it would honestly create more problems trying to address this stuff -- so I don't think think it would be worth trying to handle.

However... it would be swell if the raised exceptions could be more constructive for developers, allowing them to do something with the bad request.

Going back to my example, the exception raised is requests.exceptions.ContentDecodingError. IIRC, it bubbles up from this block: https://github.com/kennethreitz/requests/blob/master/requests/models.py#L715-L733

Perhaps these exceptions could be extended to include the self response object. That would allow a developer to examine the instant Response and then act upon the .stream attribute. Ultimately, that would allow someone to inspect the exception and determine how/why it is a "bad server" -- instead of using a hunt&peck method of what might or might not work.

Lukasa commented 7 years ago

Perhaps these exceptions could be extended to include the self response object.

@jvansco That shouldn't really be necessary. If the user sets stream=True that exception will only fire when accessing the body content, at which point the response object will already be in their hands. It's just not very hard to arrange a situation where any problems processing the body occur at a controlled and defined time.

azotlikid commented 7 years ago

Thanks for the reply, @jvanasco Inexplicably this error doesn't happen again since yesterday... Obviously it was a bug in the space-time continuum... And I still don't understand why there was no problem with urllib3 but this will remain a mystery. @Lukasa the problem is that I use a library which use requests, and I can't catch this exception without a global catch.

jvanasco commented 7 years ago

at which point the response object will already be in their hands.

@Lukasa you are 100% correct. I've been staring at the inner workings too long. my apologies!

Lukasa commented 7 years ago

;) no need for apologies @jvanasco, this is why we work in groups: it's easy for any one of us to miss the wood for the trees.