psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.04k stars 9.3k forks source link

Truncated http request #5887

Closed dwt closed 3 years ago

dwt commented 3 years ago

When requesting a development website locally, I get different results when using cURL vs httpie/requests. Namely requests gives me a shorter result. Further debugging seems to indicate that the gzip encoded response may be to blame.

Expected Result

I'd like the full server response - just as with curl

Actual Result

The output is shortened.

Reproduction Steps

import requests
r = requests.get('http://localhost:8000/Forschung/Forschungsprofil.html?c=14')
>>> pprint(dict(r.headers))
{'Cache-Control': 'public',
 'Connection': 'keep-alive',
 'Content-Encoding': 'gzip',
 'Content-Generator': 'EGOTEC ECM',
 'Content-Security-Policy': "default-src 'self' "
                            'https://www.uniklinikum-jena.de '
                            'https://www.youtube.com https://dr-flex.de '
                            'https://www.yumpu.com https://www.google.com '
                            'https://vimeo.com https://player.vimeo.com; '
                            "img-src 'self' https://www.uniklinikum-jena.de "
                            "www.krz.uni-jena.de; script-src 'self' "
                            "'unsafe-inline' 'unsafe-eval' "
                            'https://www.uniklinikum-jena.de '
                            "www.krz.uni-jena.de; style-src 'self' "
                            "'unsafe-inline'; font-src 'self'; object-src "
                            "'none'",
 'Content-Type': 'text/html; charset=utf-8',
 'Date': 'Tue, 27 Jul 2021 07:35:23 GMT',
 'ETag': 'W/"medweb2016/de/folder.c1e7bb70509f15aa29d5402206a12138.0e9304b1aef96489c56d827c1451d3db.9bb97a957c625ae7dabe79aa95c06302.1627369626"',
 'Expect-CT': 'enforce, max-age=36000',
 'Expires': 'Tue, 27 Jul 2021 08:35:23 GMT',
 'Last-Modified': 'Mon, 18 May 2020 08:44:34 GMT',
 'Pragma': 'no-cache',
 'Server': 'nginx/1.19.7',
 'Set-Cookie': 'EGOTEC=be0s8uitorpvhgvcchidiosp62; path=/; '
               'domain=uniklinikum-jena.de; HttpOnly',
 'Transfer-Encoding': 'chunked',
 'X-Cache-Status': 'MISS',
 'X-Content-Type-Options': 'nosniff',
 'X-Frame-Options': 'sameorigin',
 'X-XSS-Protection': '1; mode=block'}
>>> r.text[-200:]
"449');\r\n\r\n\t}\r\n\t, mouseleave : function() {\r\n\t\tanode.css('background', initcolor);\r\n\t\tinode.css('background', initarrowcolor);\r\n\t}\r\n\t}\r\n\t, '.subnavigation a, .subnavigation i');\r\n\r\n</script>\n\t\n</html>\n"

It seems to work correctly, when I give an empty / remove the 'Accept-Encoding' header from the request.

>>> r = requests.get('http://localhost:8000/Forschung/Forschungsprofil.html?c=14', headers={'Accept-Encoding':''})
>>> pprint(dict(r.headers))
{'Cache-Control': 'public',
 'Connection': 'keep-alive',
 'Content-Generator': 'EGOTEC ECM',
 'Content-Security-Policy': "default-src 'self' "
                            'https://www.uniklinikum-jena.de '
                            'https://www.youtube.com https://dr-flex.de '
                            'https://www.yumpu.com https://www.google.com '
                            'https://vimeo.com https://player.vimeo.com; '
                            "img-src 'self' https://www.uniklinikum-jena.de "
                            "www.krz.uni-jena.de; script-src 'self' "
                            "'unsafe-inline' 'unsafe-eval' "
                            'https://www.uniklinikum-jena.de '
                            "www.krz.uni-jena.de; style-src 'self' "
                            "'unsafe-inline'; font-src 'self'; object-src "
                            "'none'",
 'Content-Type': 'text/html; charset=utf-8',
 'Date': 'Tue, 27 Jul 2021 07:47:15 GMT',
 'ETag': 'W/"medweb2016/de/folder.c1e7bb70509f15aa29d5402206a12138.0e9304b1aef96489c56d827c1451d3db.9bb97a957c625ae7dabe79aa95c06302.1627369626"',
 'Expect-CT': 'enforce, max-age=36000',
 'Expires': 'Tue, 27 Jul 2021 08:35:23 GMT',
 'Last-Modified': 'Mon, 18 May 2020 08:44:34 GMT',
 'Pragma': 'no-cache',
 'Server': 'nginx/1.19.7',
 'Set-Cookie': 'EGOTEC=pevpss1qf4de0uvd064gippn43; path=/; '
               'domain=uniklinikum-jena.de; HttpOnly',
 'Transfer-Encoding': 'chunked',
 'X-Cache-Status': 'MISS',
 'X-Content-Type-Options': 'nosniff',
 'X-Frame-Options': 'sameorigin',
 'X-XSS-Protection': '1; mode=block',
 'X-cached': '1'}
>>> r.text[-200:]
'or it\'s own uses in client pages.\n    window._proxy_jquery = window.$.noConflict()\n</script>\n<script src="/proxy_static/added.js"></script>\n\n<style>\n    @import url(/proxy_static/added.css);\n</style>\n'

What I'm doing server-side, is that I'm adding stuff to requests using Nginx's add_after_body directive. I'm not entirely sure what the problem is and how to further debug this, here's the Nginx config I've used to track this down so far:

# daemon off;

include conf.d/error_logging.include;
include conf.d/workers.include;

http {
    include conf.d/access_logging.include;
    include conf.d/mime_types.include;

    upstream auth {
        server auth:8003;
    }

    upstream zms {
        server zms:8080;
    }

    server {
        listen 8000;
        server_name localhost;

        # include conf.d/proxy.include;

        # error_page 401 = @error401;
        location @error401 {
            # Let zms handle all authentication errors
            # TODO consider to redirect to a pre-determined url?
            proxy_pass http://zms;
        }

        # error_page 404 = @fallback;
        # location @fallback {
        #     proxy_pass http://zms;
        # }

        # include conf.d/auth_locations.include;

        location /proxy_static {
            alias /srv/www/static;
            add_after_body ""; # disable add after body to these locations
        }
        add_after_body /proxy_static/custom_css_and_js.html;

        location / {
            # include conf.d/proxy.include;
            # include conf.d/auth_authenticate.include;
            proxy_pass http://zms;
        }

        # If needed additional namespaces can be be added here, that also get authenticated via auth_request
        # zms is special in that it handles authentication failures for all services, so no unauthenticated request
        # ever gets through to any third party service added here.

        resolver 8.8.8.8 ipv6=off;

        location /Forschung {
            # check with something like this
            # while true;  do http http://localhost:8000/Forschung/Forschungsprofil.html\?cache_breaker=$(date +%s) | tail; sleep 1; done
            # while true;  do http http://localhost:8000/Forschung/Forschungsprofil_missing.html\?cache_breaker=$(date +%s) | tail; sleep 1; done

            # Not sure why, but this doesn't seem to work with external proxy
            # add_after_body /proxy_static/custom_css_and_js.html;

            # serve static files to ease debugging
            # alias /srv/www/static/Forschung;
            # mitmproxy - to help debug what is happening in proxied requests
            # proxy_pass http://host.docker.internal:8001;
            # proxy_pass https://www.uniklinikum-jena.de;
            proxy_pass http://localhost:8010;

            # Fixes the problem that the upstream server doesn't resolve the url/host correctly
            # proxy_ssl_server_name on;
            # this also fixed the resolution issue - but I think proxy_ssl_server_name is the far better solution
            # proxy_set_header Host www.uniklinikum-jena.de;

            # proxy_redirect https://www.uniklinikum-jena.de/Forschung /Forschung/;
            add_after_body /debugging/custom_css_and_js2.html;
        }

        location /debugging/ {
            proxy_pass http://host.docker.internal:8100;
        }

        # sub_filter '</html>' 'fnord</html>';
        # sub_filter_once off;

    }

    server {
        listen 8010;
        server_name localhost;
        # include conf.d/proxy.include;

        location / {
            proxy_pass https://www.uniklinikum-jena.de;
            proxy_ssl_server_name on;
            add_after_body /proxy_static/custom_css_and_js.html;
        }
        add_after_body /proxy_static/custom_css_and_js.html;
    }
}

Sorry, not really minimised…

System Information

$ python -m requests.help
{
  "chardet": {
    "version": null
  },
  "charset_normalizer": {
    "version": "2.0.3"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "3.2"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.6"
  },
  "platform": {
    "release": "20.6.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.26.0"
  },
  "system_ssl": {
    "version": "101010bf"
  },
  "urllib3": {
    "version": "1.26.6"
  },
  "using_charset_normalizer": true,
  "using_pyopenssl": false
}

MacOS 11.5 (20G71) Python 2.7 and 3.9 (same result)

% curl --version
curl 7.64.1 (x86_64-apple-darwin20.0) libcurl/7.64.1 (SecureTransport) LibreSSL/2.8.3 zlib/1.2.11 nghttp2/1.41.0
Release-Date: 2019-03-27
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS GSS-API HTTP2 HTTPS-proxy IPv6 Kerberos Largefile libz MultiSSL NTLM NTLM_WB SPNEGO SSL UnixSockets

% nginx -v
nginx version: nginx/1.19.7
dwt commented 3 years ago

Looking into this more it starts to make sense what is going on:

% curl -vH "Accept-Encoding: gzip"  http://localhost:8000/Forschung/Forschungsprofil.html\?cache_breaker=$(date +%s) --output /dev/stdout| tail
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8000 (#0)
> GET /Forschung/Forschungsprofil.html?cache_breaker=1627372645 HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.64.1
> Accept: */*
> Accept-Encoding: gzip
> 
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/1.1 200 OK
< Server: nginx/1.19.7
< Date: Tue, 27 Jul 2021 07:57:26 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: EGOTEC=7dpipsc9o7hfimool3friup4q1; path=/; domain=uniklinikum-jena.de; HttpOnly
< Pragma: no-cache
< Content-Generator: EGOTEC ECM
< ETag: W/"medweb2016/de/folder.c1e7bb70509f15aa29d5402206a12138.0e9304b1aef96489c56d827c1451d3db.693071056e9af8f2e740022f26561fb9.1627369626"
< Expires: Tue, 27 Jul 2021 08:57:27 GMT
< Cache-Control: public
< Last-Modified: Mon, 18 May 2020 08:44:34 GMT
< Content-Encoding: gzip
< X-Frame-Options: sameorigin
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Expect-CT: enforce, max-age=36000
< Content-Security-Policy: default-src 'self' https://www.uniklinikum-jena.de https://www.youtube.com https://dr-flex.de https://www.yumpu.com https://www.google.com https://vimeo.com https://player.vimeo.com; img-src 'self' https://www.uniklinikum-jena.de www.krz.uni-jena.de; script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.uniklinikum-jena.de www.krz.uni-jena.de; style-src 'self' 'unsafe-inline'; font-src 'self'; object-src 'none'
< X-Cache-Status: MISS
< 
{ [7047 bytes data]
100 18807    0 18807    0     0   4750      0 --:--:--  0:00:03 --:--:--  4749
* Connection #0 to host localhost left intact
* Closing connection 0
M?`/i??5Vݓ\???  ?U%sT;????%?????$+??x??????b6ZU?Z?h|??d',B??J?B?$9S?R$*22?ђ???5F?QI?r?FR??"މ[?Gt??Pݥ??D AP?ˣ?#?HC[y?-''jx???
                                                                                                                            $*
                                                                                                                              ?I7'I???3B ???14$%?=9rA"<ET?x?z1-?IFC9XR???A???E?S?%=v?????ewb$?˸??y Rpk6{?5?T??CAQ?f???.B??\U)? ?—V??c?O?M??>\??<Î??!?PL??"?{?r?c????m?¢??9d?FO????IW?[??Gq??NmoL??X)5>???   <?'??x?x%c
                                                                                                     4(????"?|?]x????:??<script src="/proxy_static/jquery-3.6.0.slim.min.js"></script>
<script>
    // this ensures there is no colision with anything that uses the $ variable for it's own uses in client pages.
    window._proxy_jquery = window.$.noConflict()
</script>
<script src="/proxy_static/added.js"></script>

<style>
    @import url(/proxy_static/added.css);
</style>

As far as I can tell, my nginx configuration is appending a decoded response verbatim to a gziped one - and when decoding requests discards the un-gzipped part.

Does that make sense? Is that what is expected of a request handler?

dwt commented 3 years ago

Found a workaround for me, adding

http {
  gzip off;
  gunzip on;
  …
}

to my nginx config forced it to (correctly?) decompress upstream proxied content before concatenating it with uncompressed local content in add_after_body.

Though I'm still not sure that there is some kind of bug in requests?

At least, I'd like to suggest that I would have been really helpful to get some kind of warning that there was discarded content after the gziped response - that would have saved me (and I'm sure countless other persons) days of debugging.

dwt commented 3 years ago

See also: https://trac.nginx.org/nginx/ticket/2222#ticket

sigmavirus24 commented 3 years ago

@dwt Great find! That looks like it was hard and frustrating to find.

Yes what you've discovered is exactly how we deal with junk. This is not a bug in requests. There's no way for us to know what to do with data that can not be decompressed. It could be absolute junk so what we do with it is fine. We could also throw an exception flat out but that wouldn't be as user-friendly as they now need to figure out how to extract potentially useful data from the response.

We already get harassed on social media when we add warnings that users ask for. Adding one here I think might be far noisier, and not remotely as useful to the user as you think. Most people aren't using requests against their own server but instead against someone else's over which they have no control or influence. Further, the raw bytes can be downloaded and this escape hatch is documented so if someone did see odd behaviour the docs call out being able to use

r = requests.get(url, stream=True)
raw_bytes = r.raw.read()

(Or something along those lines)

To be clear, I think the configuration options you have in Nginx are a footgun that Nginx needs to fix. It's too easy to send invalid response bodies that way. I've never seen anyone else encounter this on StackOverflow or our issue tracker in the 8? years I've worked on this project. If there are countless other persons that have had to debug this, they've been incredibly quiet.

dwt commented 3 years ago

Regarding a warning: I was using requests through httpie, and of course tried the verbose switch to understand better what was a happening. What would have really helped me would have been a warning that was visible in the verbose mode.

Same using requests directly. It would have been imensely helpful, if there was a way to get a warning / log event or just any indication that this has happened when asking for it.

Some things I could imagine:

I can understand that people don't like their apps to throw lots of warnings when trying to get junk from junk servers on the junk internet - but a way to look for problems and then find out what is happening is something I would think should be really usefull?

What do you think?

dwt commented 3 years ago

Just for reference, turns out that

gzip off;
gunzip on;

didn't actually work around the issue. I must've been star struck not to notice. :-(

What actually did it was disabling requesting compressed pages from upstream with

proxy_set_header Accept-Encoding "";
sigmavirus24 commented 3 years ago

An attribute on the response that lists 'anomalies' (and is perhaps printed with repr() that allows to find out stuff like this - not essential in the usual use cases, but something that one has a chance to discover, especially interactively.

We don't actually control the decompression ourselves, so there's no way for us to present that to you.

* A `logging.debug()` or `loging.info()` level warning / log event

Someone tore out all of our logging when they released 2.0 and it hasn't been added back. Plus "useful logging" is hard for a library as widely used as ours to get right for everyone that uses it. When we had more log messages we got tonnes of issues that we were too noisy (and I think we had approximately 10 calls to a logger) because we end up in the critical, repeatedly called, path of a lot of things.

way to look for problems and then find out what is happening is something I would think should be really usefull?

That would be useful. Unfortunately we sit on a lot of other stacks (also known as the shoulders of giants): urllib3, which uses python's httplib (whose state machine just drops entire responses sometimes - see also 100 Continue), on top of python's socket libraries (which bind to the OS) which ignores all of the openssl layers when using HTTPs. We just don't have the access to the things that would be most useful in a situation like this one sadly. There are some ways to reconstruct some of this smartly - https://toolbelt.readthedocs.io/en/latest/dumputils.html but they're only so good and don't help in this case.