Closed dwt closed 3 years ago
Looking into this more it starts to make sense what is going on:
% curl -vH "Accept-Encoding: gzip" http://localhost:8000/Forschung/Forschungsprofil.html\?cache_breaker=$(date +%s) --output /dev/stdout| tail
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8000 (#0)
> GET /Forschung/Forschungsprofil.html?cache_breaker=1627372645 HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.64.1
> Accept: */*
> Accept-Encoding: gzip
>
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0< HTTP/1.1 200 OK
< Server: nginx/1.19.7
< Date: Tue, 27 Jul 2021 07:57:26 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: EGOTEC=7dpipsc9o7hfimool3friup4q1; path=/; domain=uniklinikum-jena.de; HttpOnly
< Pragma: no-cache
< Content-Generator: EGOTEC ECM
< ETag: W/"medweb2016/de/folder.c1e7bb70509f15aa29d5402206a12138.0e9304b1aef96489c56d827c1451d3db.693071056e9af8f2e740022f26561fb9.1627369626"
< Expires: Tue, 27 Jul 2021 08:57:27 GMT
< Cache-Control: public
< Last-Modified: Mon, 18 May 2020 08:44:34 GMT
< Content-Encoding: gzip
< X-Frame-Options: sameorigin
< X-XSS-Protection: 1; mode=block
< X-Content-Type-Options: nosniff
< Expect-CT: enforce, max-age=36000
< Content-Security-Policy: default-src 'self' https://www.uniklinikum-jena.de https://www.youtube.com https://dr-flex.de https://www.yumpu.com https://www.google.com https://vimeo.com https://player.vimeo.com; img-src 'self' https://www.uniklinikum-jena.de www.krz.uni-jena.de; script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.uniklinikum-jena.de www.krz.uni-jena.de; style-src 'self' 'unsafe-inline'; font-src 'self'; object-src 'none'
< X-Cache-Status: MISS
<
{ [7047 bytes data]
100 18807 0 18807 0 0 4750 0 --:--:-- 0:00:03 --:--:-- 4749
* Connection #0 to host localhost left intact
* Closing connection 0
M?`/i??5Vݓ\??? ?U%sT;????%?????$+??x??????b6ZU?Z?h|??d',B??J?B?$9S?R$*22?ђ???5F?QI?r?FR??"މ[?Gt??Pݥ??D AP?ˣ?#?HC[y?-''jx???
$*
?I7'I???3B ???14$%?=9rA"<ET?x?z1-?IFC9XR???A???E?S?%=v?????ewb$?˸??y Rpk6{?5?T??CAQ?f???.B??\U)? ?V??c?O?M??>\??<Î??!?PL??"?{?r?c????m?¢??9d?FO????IW?[??Gq??NmoL??X)5>??? <?'??x?x%c
4(????"?|?]x????:??<script src="/proxy_static/jquery-3.6.0.slim.min.js"></script>
<script>
// this ensures there is no colision with anything that uses the $ variable for it's own uses in client pages.
window._proxy_jquery = window.$.noConflict()
</script>
<script src="/proxy_static/added.js"></script>
<style>
@import url(/proxy_static/added.css);
</style>
As far as I can tell, my nginx configuration is appending a decoded response verbatim to a gziped one - and when decoding requests discards the un-gzipped part.
Does that make sense? Is that what is expected of a request handler?
Found a workaround for me, adding
http {
gzip off;
gunzip on;
…
}
to my nginx config forced it to (correctly?) decompress upstream proxied content before concatenating it with uncompressed local content in add_after_body
.
Though I'm still not sure that there is some kind of bug in requests?
At least, I'd like to suggest that I would have been really helpful to get some kind of warning that there was discarded content after the gziped response - that would have saved me (and I'm sure countless other persons) days of debugging.
@dwt Great find! That looks like it was hard and frustrating to find.
Yes what you've discovered is exactly how we deal with junk. This is not a bug in requests. There's no way for us to know what to do with data that can not be decompressed. It could be absolute junk so what we do with it is fine. We could also throw an exception flat out but that wouldn't be as user-friendly as they now need to figure out how to extract potentially useful data from the response.
We already get harassed on social media when we add warnings that users ask for. Adding one here I think might be far noisier, and not remotely as useful to the user as you think. Most people aren't using requests against their own server but instead against someone else's over which they have no control or influence. Further, the raw bytes can be downloaded and this escape hatch is documented so if someone did see odd behaviour the docs call out being able to use
r = requests.get(url, stream=True)
raw_bytes = r.raw.read()
(Or something along those lines)
To be clear, I think the configuration options you have in Nginx are a footgun that Nginx needs to fix. It's too easy to send invalid response bodies that way. I've never seen anyone else encounter this on StackOverflow or our issue tracker in the 8? years I've worked on this project. If there are countless other persons that have had to debug this, they've been incredibly quiet.
Regarding a warning: I was using requests through httpie, and of course tried the verbose switch to understand better what was a happening. What would have really helped me would have been a warning that was visible in the verbose mode.
Same using requests directly. It would have been imensely helpful, if there was a way to get a warning / log event or just any indication that this has happened when asking for it.
Some things I could imagine:
logging.debug()
or loging.info()
level warning / log event so there is a chance to find this, but also nothing that a needs to show by default. Ideally, this was enabled by default when using http -v
(httpie).repr()
that allows to find out stuff like this - not essential in the usual use cases, but something that one has a chance to discover, especially interactively.I can understand that people don't like their apps to throw lots of warnings when trying to get junk from junk servers on the junk internet - but a way to look for problems and then find out what is happening is something I would think should be really usefull?
What do you think?
Just for reference, turns out that
gzip off;
gunzip on;
didn't actually work around the issue. I must've been star struck not to notice. :-(
What actually did it was disabling requesting compressed pages from upstream with
proxy_set_header Accept-Encoding "";
An attribute on the response that lists 'anomalies' (and is perhaps printed with
repr()
that allows to find out stuff like this - not essential in the usual use cases, but something that one has a chance to discover, especially interactively.
We don't actually control the decompression ourselves, so there's no way for us to present that to you.
* A `logging.debug()` or `loging.info()` level warning / log event
Someone tore out all of our logging when they released 2.0 and it hasn't been added back. Plus "useful logging" is hard for a library as widely used as ours to get right for everyone that uses it. When we had more log messages we got tonnes of issues that we were too noisy (and I think we had approximately 10 calls to a logger) because we end up in the critical, repeatedly called, path of a lot of things.
way to look for problems and then find out what is happening is something I would think should be really usefull?
That would be useful. Unfortunately we sit on a lot of other stacks (also known as the shoulders of giants): urllib3, which uses python's httplib (whose state machine just drops entire responses sometimes - see also 100 Continue), on top of python's socket libraries (which bind to the OS) which ignores all of the openssl layers when using HTTPs. We just don't have the access to the things that would be most useful in a situation like this one sadly. There are some ways to reconstruct some of this smartly - https://toolbelt.readthedocs.io/en/latest/dumputils.html but they're only so good and don't help in this case.
When requesting a development website locally, I get different results when using cURL vs httpie/requests. Namely requests gives me a shorter result. Further debugging seems to indicate that the gzip encoded response may be to blame.
Expected Result
I'd like the full server response - just as with curl
Actual Result
The output is shortened.
Reproduction Steps
It seems to work correctly, when I give an empty / remove the 'Accept-Encoding' header from the request.
What I'm doing server-side, is that I'm adding stuff to requests using Nginx's
add_after_body
directive. I'm not entirely sure what the problem is and how to further debug this, here's the Nginx config I've used to track this down so far:Sorry, not really minimised…
System Information
MacOS 11.5 (20G71) Python 2.7 and 3.9 (same result)