Closed ysangkok closed 7 years ago
I agree, we should pick one of those two options. =)
So commenting here as I should have done >6 hours ago:
If requests insists on doing chunked transfer encoding, it should disregard the content length and actually chunk the content (as it does if there is not Content-Length header given).
You're implying that in this instance we didn't actually chunk the data. Do you have actual evidence of that? The output alone is not nearly enough to demonstrate that.
Furthermore running your example above in 2.0.0 does not work: I get this traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/api.py", line 88, in post
return request('post', url, data=data, **kwargs)
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/sessions.py", line 357, in request
resp = self.send(prep, **send_kwargs)
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/sessions.py", line 460, in send
r = adapter.send(request, **kwargs)
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/adapters.py", line 319, in send
timeout=timeout
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 541, in urlopen
body=body, headers=headers)
File "/Users/icordasc/virtualenv/vcr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 366, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 955, in request
self._send_request(method, url, body, headers)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 989, in _send_request
self.endheaders(body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 951, in endheaders
self._send_output(message_body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 815, in _send_output
self.send(message_body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 787, in send
self.sock.sendall(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
TypeError: must be string or buffer, not generator
This isn't even an actual issue as far as I can tell. Not at least on 2.0.0 which is the only currently supported version of requests around.
Did not expect "Transfer-Encoding: chunked" since I provided the Content-Length
What gave you the impression that this would be the case?
What gave me the impression was the fact that Requests did not chunk (even if it set the header) once I set Content-Length.
The example works with Python 3.3.
Trying it on Python 3.3 it does not blow up, on the other hand it did chunk the content. The chunking is not platform dependent so I am thoroughly confused by that.
POST / HTTP/1.1
Host: 127.0.0.1:8801
Content-Length: 6
User-Agent: python-requests/1.2.3 CPython/3.3.2 Darwin/12.5.0
Transfer-Encoding: chunked
Accept: */*
Accept-Encoding: gzip, deflate, compress
3
lol
3
man
0
Secondly, I'd still like to know what gave you the impression that setting the Content-Length
header would prevent chunked encoding of data.
@sigmavirus24 You're not using Requests 2.0.0 on your Python 3.3 installation.
@ysangkok good point. :-) With 2.0 it's definitely broken. Regardless my opinion is definitely in the court of us removing the Content-Length
header.
You're probably right @sigmavirus24, we should probably strip the Content-Length header.
Hm, I'm not entirely convinced of my own argument anymore. I went looking for the old discussions surrounding users setting the Content-Length
header themselves and I must be remembering old IRC conversations.
I'm still of the opinion that users should really not be setting those headers themselves and that we should remove them, however, I think this issue points out a far more important issue, which is the vastly different behaviour of requests on two different versions of Python.
On 2.7 (as I demonstrated) setting the Content-Length
header does nothing to change how requests uploads the data. On 3.3, however, @ysangkok is correct that setting it sends everything as soon as it can (it still uses the generator but does not send it in an actually chunked manor).
One easy way to fix this is to remove the header when using a generator (or always to provide a consistent behaviour), the flaw with this is that this is backwards incompatible behaviour.
The other easy way is to break the consistency of the API by not always using chunked transfer encoding with a generator. @Lukasa this definitely needs some deeper thought since I can't find the old conversations where users were admonished for setting that header themselves.
To be honest though, I would never expect that setting a header would change the behaviour of using a generator though.
This certainly is an extremely sticky situation
Bleh. I'll ponder.
Also this isn't a breaking API change that I would hesitate to make because I have to wonder how many people are actually relying on this behaviour.
I've run into a situation where the length of a generator (for example, a streamed file from another server or really large disk on file) is known. For example:
response = requests.get('http://example.com/mybig.iso', stream=True)
length = response.headers.get('Content-Length')
def exhaust(response):
while True:
out = response.raw.read(1024*1024)
if not out:
break
yield out
response = requests.post('http://example.com/upload', data=exhaust(response), headers={'Content-Length': length})
That might be a valid reason to keep Content-Length around. The only workaround right now (that I know of) is to exhaust the generator in memory and pass the bytes directly in.
@bryanhelmig does the server you're uploading to require you to send the Content-Length header?
@bryanhelmig did you see the comments in the linked pull request?
Anyway, I don't understand why Content-Transfer-Encoding isn't just a flag. No need to delete any headers (or do any other kind of hand-holding), it was never a question of the Content-Length being right or wrong, the actual issue is that the presence of Content-Length semi-disables Content-Transfer-Encoding, which makes no sense at all. But just making Requests ignore Content-Length is not solving the real problem, which is, that Requests uses Content-Transfer-Encoding when it feels like it (sounds like that is supposed to be when reading from a generator), even though many web servers don't even support it.
Ignoring Content-Length will confuse people who supply it. If you (@sigmavirus24) insist on not transmitting it, why not just throw an exception? As you said, this functionality is probably not used widely.
In the pull request, you said "The use case in the issue is just an example of the behaviour. There is no justification there for why you're doing what you're doing.". I disagree, I think the original code in this issue is perfectly normal behaviour, and in fact I think that streaming of POST data is a huge use case, and that it's ridiculous if one is forced to use Content-Transfer-Encoding or resort to lower-level libraries when streaming/using generators.
So to summarize: Content-Transfer-Encoding should be a flag, illegal parameter combinations should provoke exceptions, and user-supplied flags should be sent when possible. And of course, it shouldn't be possible to semi-disable Content-Transfer-Encoding.
Stop stop stop.
Everyone take a breather.
@ysangkok You can do streaming uploads without generators just fine. Provide Requests a file-like object in the data parameter and that will work. Yes, it's not as simple as using a generator, but that's ok because it's still not very hard.
In the meantime, Requests should not suggest that it is chunking data when it does not. We are all agreed on that. The question is what we should do in your specific case: namely, providing a generator and a Content-Length
. You and @sigmavirus24 legitimately disagree on this issue, which is fine. However, can we all please acknowledge that both camps have rational reasons to expect their position?
@ysangkok You've said that "Ignoring Content-Length will confuse people who supply it." @sigmavirus24 contends that ignoring the very clear documentation when provided with a generator will confuse people who do that. You are both right.
(As a side note, the fact that many servers don't understand Transfer-Encoding
is just a wild assertion based on no evidence that I've seen. Until any evidence is provided, I'm choosing to ignore it.)
One way or another we're going to have to pick what we do here. It's possible that the correct decision is to throw an exception when both a generator and Content-Length
are provided. That's viable. It doesn't even make @bryanhelmig's case worse, because he should just be passing response.raw
straight through rather than wrapping it in a decorator (for your benefit, Bryan).
I'm naturally inclined to sit on the fence here and throw a YoureACrazyPerson
exception, but I can see why both of you believe what you believe. In particular, making decisions based on user-supplied headers is lame and confusing, and we should try not to do that. The following, however, are hard lines:
Transfer-Encoding
will not be a flag. Not now, not ever. Requests does not do tiny special case flags like this.does the server you're uploading to require you to send the Content-Length header?
@sigmavirus24 it does. :-( 411 for attempts without a Content-Length. Quite annoying IMO.
did you see the comments in the linked pull request?
@ysangkok I did.
he should just be passing response.raw straight through rather than wrapping it in a decorator
@Lukasa That was my original edit actually, but I'm not sure what that nets us in this case besides a chance to link to the docs. Unless I am mistaken, a file object isn't guaranteed to give you a length, generators are just guaranteed to not give a length. I noticed in testing that small files didn't get chucked, a generator forced it.
Thanks for all the thorough replies everyone. Really though, it is fine if users in exotic situations have to swap to a different lib for one out of a 100 requests. Life is much easier for the 99 other cases.
Really though, it is fine if users in exotic situations have to swap to a different lib for one out of a 100 requests.
I'd prefer Requests Make Easy Things Easy & Hard Things Possible :)
@piotr-dobrogost Agreed, but if making a hard thing possible requires making an easy thing harder, we'd rather leave the easy thing easy. =)
A few things:
As far as I'm concerned, this has been (and will continue to be, until we come to a decision) undefined behaviour
As a side note, the fact that many servers don't understand Transfer-Encoding is just a wild assertion based on no evidence that I've seen. Until any evidence is provided, I'm choosing to ignore it.
There's a difference between servers not understanding a chunked Transfer-Encoding and servers not wanting to respect it. I suspect the latter is the case in @bryanhelmig's case. Alternatively, the application could have been written by someone who doesn't understand or know about Transfer-Encoding and so requires a Content-Length.
It's possible that the correct decision is to throw an exception when both a generator and Content-Length are provided. That's viable.
Exceptions might be too extreme in this case. We don't generally raise exceptions for anything other than invalid URLs. The way we process the data and files parameters can raise exceptions but we don't special case anything there. That said, we've been talking about how poor an idea it is when users specify their own Content-Length and Host headers (among others which I'm probably forgetting). Since these aren't technically invalid practices, but rather practices we advise against, I suggest that we instead trigger a warning and then do the right thing in certain well documented situations.
Using warnings is a more gentle way to inform the user what they're doing is not advisable. It also covers the fact that so many of our users do not seem to bother to read the documentation and so the library becomes self-documenting of sorts. This also gives us the ability to change the behaviour in the future. And even better, if users want to disable the warnings, they can because python provides a way of silencing warnings.
There's a difference between servers not understanding a chunked Transfer-Encoding and servers not wanting to respect it. I suspect the latter is the case in @bryanhelmig's case. Alternatively, the application could have been written by someone who doesn't understand or know about Transfer-Encoding and so requires a Content-Length.
I actually think this is very likely the case in most of my examples. We (@zapier) attempt to upload files to over a dozen different APIs and the few of them that require Content-Length (seem to) timeout with chunked Transfer-Encoding.
As a side note, the fact that many servers don't understand Transfer-Encoding is just a wild assertion based on no evidence that I've seen. Until any evidence is provided, I'm choosing to ignore it.
I could put together a test suite of how various services respond to Content-Length/Transfer-Encoding, but I feel like even if it is incorrectly implemented APIs, that shouldn't really advise the design of python-requests. Easier still, I could just name names based on my experience fighting this for the last week, but again, if they are API/server bugs, what is the use of such information to python-requests?
I suggest that we instead trigger a warning and then do the right thing in certain well documented situations.
Agree on default behavior, but sometimes reality trumps the "right thing" (especially when you have no control over a potentially broken server). It might be nice to document a technique to override (even if it advocates for the user to do a lot of work, like writing a custom Adapter).
Just to clarify: by "the right thing" I do not necessarily mean the RFC thing. (In case that's what you think I meant)
I could put together a test suite of how various services respond to Content-Length/Transfer-Encoding, but I feel like even if it is incorrectly implemented APIs, that shouldn't really advise the design of python-requests.
I agree that misbehaving servers shouldn't inform our design.
Easier still, I could just name names based on my experience fighting this for the last week, but again, if they are API/server bugs, what is the use of such information to python-requests?
We already know how broken the web is. Regardless, knowing whether we should delete the header after warning the user or leave it would be where that data could be useful.
Bleh. This makes me sad.
Ok, I think @sigmavirus24's plan is the best one here, at least for now.
A few things that came up are distracting me from getting you guys some detailed data, but here is a brain dump of what I've seen:
The biggest offender I've seen are API endpoints that require multipart upload: they usually need a Content-Length or they freak out on Transfer-Encoding chunked (I'm not sure which). That isn't the case everywhere, but here are the bits I can pull out of our source with a quick grep:
The reason this is fresh in my mind is we've built a custom multipart body generator that works with "lazy" files that can also calculate the length, this happened to bubble up a lot of the issues we're talking about here. Right now we're just cheating and doing getvalue()
for the failing upload endpoints. We'll likely revisit it someday.
The other examples are harder to grep for, but the file companies (Box, Dropbox, SugarSync, etc...) all get it and work perfectly fine with chunked encoding, so no worries there.
Hopefully this shines a little more light on our real world use case. I wish I could get you more information to confirm which headers commonly cause which errors (they are usually timeouts).
This should be easier to test now that we have a few APIs that we can reproduce against, namely Twitter.
I have the same exact use case : uploading data to a server that does not support chunked encoding. Data length is known in advance, but does not come from a file (from a generator). I expected that setting a content length header would disable length calculation in requests, and also disable chunked transfer encoding. By now I manage to work-around this problem by adding an additional 'len' attribute to my generator object, so that requests utils.super_len() returns something so chunked encoding is not chosen by requests : this is ugly and brillant. As a unix user (and like @piotr-dobrogost ), I would expect that programmer knows what he does, and library should obey : providing a content-length header is a clear indication that chunked encoding is not to be used. But this conflicts with your sentence above : "I would never expect that setting a header would change the behaviour of using a generator though". Well, if is clearly documented I don't see the point. It wouln't break API, would it ?
@netheosgithub I'd argue that changing this behaviour absolutely constitutes a change in the API. Consider the current state of the API, from the documentation:
Requests also supports Chunked transfer encoding for outgoing and incoming requests. To send a chunk-encoded request, simply provide a generator (or any iterator without a length) for your body.
Note that the documentation doesn't say "To send a chunk-encoded request, simply provide a generator and do not set the Content-Length header." This makes this an API change in my book. Not just any API change: a bad one. The idea that setting a header changes the body of the request makes me feel deeply uncomfortable. Consider if someone asked us for a similar request whereby if they set the header Content-Type: application/json
, we should JSON-encode the data
parameter instead of form-encoding it! I'd throw that request out in a heartbeat.
I think we should try to address the root of the problem: why are people using generators in this situation?
Currently, providing a generator and specifying a content-length is an error (it generates an invalid http request), so this use case should not be used by anybody. That's why I was thinking it would not break your user's programs. Why generators ? Data is not always provided by a file like object. For example, I'd like to watch upload progress by yielding data chunk by chunk, etc. (otherwise I'd have to override read() methods)
If only your 'invalid HTTP request' argument was valid. I wish I lived in that world. However, many many servers will surely happily accept such a combination.
RFC 2616 provides this illuminating section on determining the message length:
The transfer-length of a message is the length of the message-body as it appears in the message; that is, after any transfer-codings have been applied. When a message-body is included with a message, the transfer-length of that body is determined by one of the following (in order of precedence):
- Any response message which "MUST NOT" include a message-body (such as the 1xx, 204, and 304 responses and any response to a HEAD request) is always terminated by the first empty line after the header fields, regardless of the entity-header fields present in the message.
- If a Transfer-Encoding header field (Transfer-Encoding) is present and has any value other than "identity", then the transfer-length is defined by use of the "chunked" transfer-coding (Transfer Codings), unless the message is terminated by closing the connection.
- If a Content-Length header field (Content-Length) is present, its decimal value in OCTETs represents both the entity-length and the transfer-length. The Content-Length header field MUST NOT be sent if these two lengths are different (i.e., if a Transfer-Encoding header field is present). If a message is received with both a Transfer-Encoding header field and a Content-Length header field, the latter MUST be ignored. ... Messages MUST NOT include both a Content-Length header field and a non-identity transfer-coding. If the message does include a non-identity transfer-coding, the Content-Length MUST be ignored.
To clarify: we are in agreement that Requests is doing the wrong thing. However, we are not in agreement that setting a Content-Length
header should turn chunking off. That's the bit that is causing contention.
I should also point out that any server that fails with chunked encoding is also in violation of RFC 2616:
All HTTP/1.1 applications MUST be able to receive and decode the "chunked" transfer-coding, and MUST ignore chunk-extension extensions they do not understand.
This is not an attempt to say that Requests shouldn't fix this problem, it's simply to point out that everyone involved is violating RFC 2616.
I think we should try to address the root of the problem: why are people using generators in this situation?
One of our use cases was taking a possibly very large file with a know size from one service and uploading it to a second service, except the binary data was simply one of the parts of a multipart POST (with JSON metadata as the other) that the second service required.
So we wrapped the multipart builder in a generator; one we could know the size of. It seemed like a very obvious and useful solution, at least until this requests feature/bug stopped us (I believe we maintain a fork now, but the superlen trick might help undo that!).
We now have other use cases of generators with lens, but I'd have to dig them out.
Bryan
On Feb 18, 2014, at 12:45 AM, Cory Benfield notifications@github.com wrote:
I think we should try to address the root of the problem: why are people using generators in this situation?
@bryanhelmig Mm, yes. I think the correct solution there is from now on to use the requests toolbelt's streaming multipart encoder.
Right now though, it's unclear to me why there is such strong resistance to using file-like objects here.
In issue #1895 I encountered this issue even with a file-like object. The request was a PUT, and the file-like object happened to be sys.stdin, which is in fact so file-like that it triggered chunked encoding like it does for a regular file. Providing a Content-Length (which I learned via outside means) in this situation caused HTTPAdapter.send to not do chunking like one might expect, but PreparedRequest.prepare_body still noticed that the file-like object was iterable and added a Transfer-Encoding header anyway. The server (Amazon S3) choked on that header even though the request would have otherwise worked.
@gholms Nope. sys.stdin
is so not file-like it triggered chunked encoding. If you pass us a file object we will stream it, not chunk it. The problem with sys.stdin
is that it has no length, which falls under the description provided in the docs as mentioned above:
any iterator without length
The general expectation is that users will never explicitly set the Content-Length header. This is because Requests may end up messing with how the request body is set up.
Finally, the idea that 'one might expect' that providing a Content-Length header will disable chunking isn't true either: I'd expect us to remove the header. =)
Anyway, this discussion has gone on long enough. When provided with a lengthless iterator and a user-supplied Content-Length header, we have these options:
Any of these three options brings us into compliance with the RFC. It is clear that everyone raising this issue prefers (3). @sigmavirus24?
I've always been of the opinion that we should blow away the header. In my experience the user thinks they know what they're doing but rarely know enough to actually know what they're doing. I generally like to let the user shoot themselves in the foot and we've done that until now and it hasn't gotten us much. In fact, we had the opportunity to blow away that header before and specifically didn't do that. That has lead to exactly this issue. Let's be logical about this:
Our API and documentation has always maintained that passing an iterable with no way of measuring length means we will use chunked transfer encoding. By the spec in this case we should be blowing away the Content-Length header because some servers do not obey it. It should be ignored. We're not doing anything wrong by sending it regardless of the encoding, the server is doing the wrong thing by not ignoring it. To protect against bad servers like this we should be deleting it.
If in 3.0.0 we decide to change the API such that providing the header does not chunk that's a different beast. We're not considering 3.0.0 though and that's the real issue here. What can we do to solve this issue now. What we can do is follow the spec.
I'm inclined to agree with you @sigmavirus24. I'll see if I can get a few minutes with Kenneth at some point soon to talk this over with him.
I'm in the same boat as @bryanhelmig and @netheosgithub, I have a generator where I know in advance what size the combined chunks will have, and have a server that does not support chunked uploads (a WSGI app, WSGI according to my research does not support chunked encoding at all). The data from the generator is too big to fit into RAM, so combining the chunks before and passing it to requests is out of the question. Has there been any new development regarding this issue?
@jbaiter then what you need is to pass an object that behaves like a file with a __len__
defined. Take, for example, the toolbelt's MultipartEncoder. You want streaming, so in general all you need is something like this which has a read
method and a way of determining its length (preferably by implementing __len__
). Your object's API could look something like:
class JBaiterStreamer(object):
def __init__(self, size, iterator):
self.size = size
self.iterator = iterator
def __len__(self):
return self.size
def read(self, *args):
try:
return next(self.iterator)
except StopIteration:
return b''
Granted I haven't tried to see if that code will work, but something along those lines should.
Thanks @sigmavirus24 :+1:, I did precisely that, i was just wondering if by now there was a more elegant way to go about it
Perhaps there's a good way of providing this via the toolbelt, eh @Lukasa ?
@sigmavirus24 Absolutely. =)
@Lukasa any feedback on https://gitlab.com/sigmavirus24/toolbelt/merge_requests/2
@sigmavirus24 "We're not doing anything wrong by sending it regardless of the encoding, the server is doing the wrong thing by not ignoring it." is actually now wrong:
RFC 7230: "A sender MUST NOT send a Content-Length header field in any message that contains a Transfer-Encoding header field."
@ztane this is a different case than what that stack overflow question is referring to. In the future, please closely read a discussion before generating notifications for all of its participants.
@sigmavirus24 that was caused by "transfer-Encoding: chunked, content-length set => body not chunked." And RFC 7230 forbids setting Content-Length with Transfer-Encoding.
@ztane thank you for continuing to not read this discussion. This particular bug is about someone setting a Content-Length header by hand and expecting requests to not use a chunked upload when they provide a generator (which always triggers a chunked upload with requests). The stackoverflow link you provided and what you're talking about is a different bug that is being handled elsewhere. You're muddying this discussion with irrelevant details because we allow users to shoot themselves in the feet. For example, we allow users to specify an incorrect content-length header or invalid host header, or basically whatever they please. We don't police everything nor will we. Please focus your conversation on the correct issue in the future.
I stumbled into this when using boto3 to do a PUT from a stream against AWS S3 (https://github.com/boto/botocore/issues/911). In my case, the size of the stream is known -- it is an object from another provider. When using S3 version 2 signatures, the "Transfer-Encoding: chunked" is not supported and S3 returns the 501 error. The boto3 documentation seems to imply that setting the Content-Length would get around this issue, as they state the following:
ContentLength (integer) -- Size of the body in bytes. This parameter is useful when
the size of the body cannot be determined automatically. [1]
While boto3 should do two things -- fix their documentation and ensure that "Transfer-Encoding: chunked" is not set -- for the indirect consumer of requests this behavior is hard to debug. If the Content-Length header is ignored, raising an exception would at least make it obvious what is going on. As is, it was obfuscated by the fact that S3 only returns a 501 error with an explanation that one of the headers is not supported (but not specifying the header).
The suggested workaround to wrap it and provide the length works, but seems ugly. Would expanding the API to allow for toggling chunked encoding (while keeping the current behavior as default) be a palatable way forward (as opposed to using the Content-Length header as the flag)?
[1] http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object
@timuralp I remain opposed to adding a flag for this. It's really just unacceptable to have a HTTP/1.1 implementation that cannot handle chunked transfer encoding in 2016. It's been a specification requirement for so long that the first specification that required it is nearly old enough to vote in the United States of America: I don't think we can keep cutting entities slack for not doing it.
From my perspective the bug here remains that we may incorrectly emit both Content-Length and Transfer-Encoding. Of course, my perspective is non-binding. ;)
Further, if boto3 wants to forcibly override the Content-Length header, they should utilize the prepared request flow so they can remove the Transfer-Encoding
header.
@Lukasa @sigmavirus24 fair enough -- thanks for the prompt reply. I'll continue to look to fix the boto issue in that project.
Test script
Actual result
Received on the server:
Expected result
Did not expect "Transfer-Encoding: chunked" since I provided the Content-Length. If requests insists on doing chunked transfer encoding, it should disregard the content length and actually chunk the content (as it does if there is not Content-Length header given).