[http.client] HTTPConnection.request not support "chunked" Transfer-Encoding to send data

d70d24c8-a657-4375-8094-aa1bb635eb24 commented 13 years ago

BPO	12319
Nosy	@orsenthil, @pitrou, @vstinner, @akheron, @vadmium, @eryksun, @demianbrecht, @matrixise, @RKrahl
Files	chunkedhttp.py: A custom module implementing upload with chunked transfer encoding for urllib issue12319.patch issue12319_1.patch: Removing unused imports issue12319_2.patch issue12319_3.patch issue12319_4.patch issue12319_5.patch issue12319_6.patch chunkedhttp-2.py: Illustration of the situation after applying issue12319_6.patch from user's point of view issue12319_7.patch issue12319_8.patch issue12319_9.patch issue12319_10.patch issue12319_11.patch issue12319_12.patch chunked-file.patch

BPO

12319

Nosy

@orsenthil, @pitrou, @vstinner, @akheron, @vadmium, @eryksun, @demianbrecht, @matrixise, @RKrahl

Files

chunkedhttp.py: A custom module implementing upload with chunked transfer encoding for urllib

issue12319.patch

issue12319_1.patch: Removing unused imports

chunkedhttp-2.py: Illustration of the situation after applying issue12319_6.patch from user's point of view

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/vadmium' closed_at = created_at = labels = ['type-feature', 'library'] title = '[http.client] HTTPConnection.request not support "chunked" Transfer-Encoding to send data' updated_at = user = 'https://bugs.python.org/harobed' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'martin.panter' closed = True closed_date = closer = 'martin.panter' components = ['Library (Lib)'] creation = creator = 'harobed' dependencies = [] files = ['36493', '38351', '38352', '38528', '38670', '38772', '38797', '39447', '39466', '43398', '43579', '43973', '44057', '44076', '44133', '44217'] hgrepos = [] issue_num = 12319 keywords = ['patch', 'needs review'] message_count = 63.0 messages = ['138203', '138242', '138258', '138296', '138357', '138691', '171268', '226012', '226018', '226024', '236024', '236037', '237309', '237312', '237313', '237427', '237509', '238314', '239119', '239151', '239769', '239771', '239793', '243518', '243601', '243730', '243731', '243747', '243751', '243762', '243800', '243823', '243837', '243878', '268611', '268715', '268894', '269430', '269497', '271771', '271815', '272121', '272164', '272239', '272241', '272322', '272354', '272394', '272395', '272413', '272438', '272442', '272937', '273532', '273533', '273534', '273536', '273538', '273547', '273548', '273628', '273751', '273758'] nosy_count = 13.0 nosy_names = ['orsenthil', 'pitrou', 'vstinner', 'harobed', 'python-dev', 'petri.lehtinen', 'martin.panter', 'piotr.dobrogost', 'eryksun', 'demian.brecht', 'matrixise', 'whitemice', 'Rotkraut'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue12319' versions = ['Python 3.6'] ```

d70d24c8-a657-4375-8094-aa1bb635eb24 commented 13 years ago

Hi,

HTTPConnection.putrequest not support "chunked" Transfer-Encodings to send data.

Exemple, I can't do PUT request with chunk transfert.

Regards, Stephane

akheron commented 13 years ago

What's the use case? Do you have an iterable that yields data whose size is unknown?

AFAIK, most web servers don't even support chunked uploads.

(Removing Python 2.7 from versions as this is clearly a feature request.)

d70d24c8-a657-4375-8094-aa1bb635eb24 commented 13 years ago

I use http.client in WebDAV client.

Mac OS X Finder WebDAV client perform all his request in "chunk" mode : PUT and GET.

Here, I use http.client to simulate Mac OS X Finder WebDAV client.

Regards, Stephane

akheron commented 13 years ago

harobed wrote:

I use http.client in WebDAV client.

Mac OS X Finder WebDAV client perform all his request in "chunk" mode : PUT and GET.

Here, I use http.client to simulate Mac OS X Finder WebDAV client.

Now I'm confused. Per the HTTP specification, GET requests don't have a body, so "Transfer-Encoding: chunked" doesn't apply to them.

Are you sure you don't confuse with the response that the server sends? In responses, "Transfer-Encoding: chunked" is very common.

d70d24c8-a657-4375-8094-aa1bb635eb24 commented 13 years ago

Now I'm confused. Per the HTTP specification, GET requests don't have a body, so "Transfer-Encoding: chunked" doesn't apply to them.

Are you sure you don't confuse with the response that the server sends? In responses, "Transfer-Encoding: chunked" is very common.

Sorry, yes GET requests have "Transfer-Encoding: chunked" in server response. PUT requests can send body data in transfer-encoding chunked mode.

orsenthil commented 13 years ago

We had support for chunked transfer encoding for POST method recently, which is exposed via urllib2 wrapper function. PUT is not exposed via urllib2 and users should use httplib. This feature of chunked transfer can be added to PUT by taking the body of the message as iterable.

pitrou commented 12 years ago

We had support for chunked transfer encoding for POST method recently, which is exposed via urllib2 wrapper function.

I couldn't find what you're talking about. If I look at AbstractHTTPHandler.dorequest, it actually mandates a Content-Length header for POST data (hence no chunked encoding).

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 10 years ago

I'd like to support the request. I have a use case where I definitely need this feature: I maintain a Python client for a scientific metadata catalogue, see 1 for details. The client also features the upload of the data files. The files may come in as a data stream from another process, so my client takes a file like object as input. The files may be large (several GB), so buffering them is not an option, they must get streamed to the server as they come in. Therefore, there is have no way to know the Content-length of the upload beforehand.

I implemented chunked transfer encoding in a custom module that monkey patches the library, see the attached file. This works fine, but of course it's an awkward hack as I must meddle deeply into the internals of urllib and http.client to get this working. This module is tested to work with Python 2.7, 3.1, 3.2, 3.3, and 3.4 (for Python 3 you need to pass it through 2to3 first). I really would like to see this feature in the standard library in order to get rid of this hack in my package. I would happy to transform my module into a patch to the library if such a patch would have a chance to get accepted.

c363d87d-6318-48d8-932c-605c563ff049 commented 10 years ago

@Rotkraut

The truth is http in stdlib is dead. Your best option is to use 3rd party libs like requests or urllib3. Authors of these libs plan to get rid of httplib entirely; see "Moving away from httplib" (https://github.com/shazow/urllib3/issues/58)

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 10 years ago

Thanks for the notice! As far as I read from the link you cite, getting rid of the current httplib in urllib3 is planned but far from being done. Furthermore, I don't like packages with too many 3rd party dependencies. Since my package is working fine with the current standard lib, even though using an ugly hack in one place, I'd consider switching to urllib3 as soon as the latter makes it into the standard lib.

I still believe that adding chunked transfer encoding to http.client and urllib in the current standard lib would only require a rather small change that can easily be done such that the lib remains fully compatible with existing code. Still waiting for feedback if such a patch is welcome.

vadmium commented 9 years ago

One interesting question is how to convey data to the chunked encoder. There are two sets of options in my mind, a pull interface:

iterable: seems to be the popular way amoung commenters here
file reader object: encoder calls into stream’s read() method

and a push interface:

chunked encoder is a file writer object: user calls encoder’s write() and close() methods. This would suit APIs like saxutils.XMLGenerator and TextIOWrapper.
chunked encoder has a “feed parser” interface, codecs.IncrementalEncoder interface, or something else.

The advantage of the push interface is that you could fairly easily feed data from an iterable or file reader into it simply by just doing shutil.copyfileobj() or equivalent. But to adapt the pull interface to a push interface would require “asyncio” support or a separate thread or something to invert the flow of control. So I think building the encoder with a push interface would be best. Rolf’s ChunkedHTTPConnectionMixin class appears to only support the pull interface (iterables and stream readers).

I would welcome support for chunked uploading in Python’s “http.client” module, especially with push or stream writer support. I don’t think overwriting _send_request should be necessary; just call putrequest(), putheader() etc manually, and then call send() for each chunk. Perhaps there is scope for sharing the code with the “http.server” module (for encoding chunked responses).

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 9 years ago

The design goal for my implementation was compatibility. My version can be used as a drop in replacement for the existing urllib's HTTPHandler. The only thing that need to be changed in the calling code is that it must call build_opener() to select the chunked handler in the place of the default HTTPHandler. After this, the calling code can use the returned opener in the very same way as usual.

I guess, switching to a push interface would require major changes in the calling code.

In principle, you could certainly support both schemes at the same time: you could change the internal design to a push interface and than wrap this by a pull interface for the compatibility with existing code. But I'm not sure whether this would be worth the effort. If, as Piotr suggests, the current urllib is going to be replaced by urllib3, then I guess, its questionable if it makes sense to add major design changes that are incompatible with existing code to the current standard lib.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

I've attached a patch that implements full Transfer-Encoding support for requests as specified in RFC 7230.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

I hit "submit" a little too soon.

The intention of the patch is to adhere to all aspects of Transfer-Encoding as specified in the RFC and to make best guesses as to encoding that should be used based on the data type of the given body.

This will break backwards compatibility for cases where users are manually chunking the request bodies prior to passing them in and explicitly setting the Transfer-Encoding header. Additionally, if Transfer-Encoding was previously specified, but not chunked, the patch will automatically chunk the body.

Otherwise, the patch should only be additive.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Also note that this patch includes the changes in bpo-23350 as it's directly relevant.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

FWIW, so far I've tested this change against:

cherrypy 3.6.0 uwsgi 2.0.9 (--http-raw-body) nginx 1.6.2 (chunked_transfer_encoding on, proxy_buffering off;) + uwsgi 2.0.9 (--http-raw-body)

The chunked body works as expected. Unfortunately, all implementations seem to be ignorant of the trailer part. So it seems that although RFC-compliant (and I can definitely see the use case for it), they trailer implementation may not be overly practical. I still think that it's worthwhile keeping it, but perhaps adding a note that it may not be supported at this point.

Relevant gists: https://gist.github.com/demianbrecht/3fd60994eceeb3da8f13

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

After sleeping on this, I think that the best route to go would be to drop the trailer implementation (if it's not practical it doesn't belong in the standard library).

Also, to better preserve backwards compatibility it may be better to circumvent the automatic chunking if transfer-encoding headers are present in the request call. That way, no changes would need to be made to existing code that already supports it at a higher level.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Updated patch changes the following:

+ Removes support for trailers in requests as they're not supported + If Transfer-Encoding is explicitly set by the client, it's assumed that the caller will handle all encoding (backwards compatibility) + Fixed a bug where chunk size was being sent as decimal instead of hex

vadmium commented 9 years ago

I left a few comments on Reitveld, mainly about the documentation and API design.

However I understand Rolf specifically wanted chunked encoding to work with the existing urlopen() framework, at least after constructing a separate opener object. I think that should be practical with the existing HTTPConnection implementation. Here is some pseudocode of how I might write a urlopen() handler class, and an encoder class that should be usable for both clients and servers:

class ChunkedHandler(...):
    def http_request(...):
        # Like AbstractHTTPHandler, but don’t force Content-Length

    def default_open(...):
        # Like AbstractHTTPHandler, but instead of calling h.request():
        encoder = ChunkedEncoder(h.send)
        h.putrequest(req.get_method(), req.selector)
        for item in headers:
            h.putheader(*item)
        h.putheader("Transfer-Encoding", encoder.encoding)
        h.endheaders()
        shutil.copyfileobj(req.data, writer)
        encoder.close()

class ChunkedEncoder(io.BufferedIOBase):
    # Hook output() up to either http.client.HTTPConnection.send()
    # or http.server.BaseHTTPRequestHandler.wfile.write()

    encoding = "chunked"

    def write(self, b):
        self.output("{:X}\r\n".format(len(b)).encode("ascii"))
        self.output(b)
        self.output(b"\r\n")

    def close(self):
        self.output(b"0\r\n\r\n")

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Thanks for the review Martin.

However I understand Rolf specifically wanted chunked encoding to work with the existing urlopen() framework, at least after constructing a separate opener object. I think that should be practical with the existing HTTPConnection implementation.

The original issue was that http.client doesn't support chunked encoding. With this patch, chunked encoding should more or less come for free with urlopen. There's absolutely no reason as to why HTTPConnection should not support transfer encoding out of the box given it's part of the HTTP1.1 spec. I do understand that there are some modifications needed in urllib.request in order to support the changes here, but I didn't include those in the patch as to not conflate the patch. I think that it's also reasonable to open a new issue to address the feature in urllib.request rather than piggy-backing on this one.

vadmium commented 9 years ago

Perhaps you should make a table of some potential body object types, and figure out what the behaviour should be for request() with and without relevant headers, and endheaders() and send() with and without encode_chunked=True:

Add/don’t add Content-Length/Transfer-Encoding
Send with/without chunked encoding
Raise exception
Not supported or undefined behaviour

Potential body types:

None with GET/POST request
bytes()
Latin-1/non Latin-1 str()
BytesIO/StringIO
Ordinary binary/Latin-1/other file object
File object reading a special file like a pipe (st_size == 0)
File object wrapping a pipe or similar that does not support fileno() (ESPIPE)
Custom file object not supporting fileno() nor seeking
File object at non-zero offset
GzipFile object, where fileno() corresponds to the compressed size
GzipFile not supporting fileno(), where seeking is possible but expensive
Iterator yielding bytes() and/or strings
Generator
File object considered as an iterable of lines
List/tuple of bytes/text
Other sequences of bytes/text
Other iterables of bytes/text, e.g. set(), OrderedDict.values()

This list could go on and on. I would rather have a couple of simple rules, or explicit APIs for the various modes so you don’t have to guess which mode your particular body type will trigger.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Agreed. However, I'm wondering if that should belong in a new issue geared towards further clarifying behaviour of request body types. The patch introduces the behaviour this specific issue was looking, with the largest change being that iterators may now result in chunked transfer encoding with the data currently handled by the library. I'd rather move forward with incremental improvements rather than keep building on each issue before the work's merged (there's still a /good/ amount of work to be done in this module).

vadmium commented 9 years ago

The incremental improvement thing sounds good. Here are some things which I think are orthogonal to sensible chunked encoding:

Automagic seeking to determine Content-Length
Setting Content-Length for iterables that are neither strings, iterators nor files (bpo-23350)
Latin-1 encoding of iterated items

pitrou commented 9 years ago

What's the status on this one? It looks like some review comments need addressing.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

What's the status on this one? It looks like some review comments need addressing.

That's about it. Unfortunately I've been pretty tied up over the last month and a bit. I'll try to get to hopefully closing this out over the next few days.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Latest patch should address all outstanding comments.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

BTW, thanks for the reviews Martin and the nudge Antoine!

vadmium commented 9 years ago

I left some new comments.

However I remain concerned at how complicated and overloaded the API is becoming. It certainly makes it hard to review for correctness. I could easily have missed some corner case that is broken by the changes. There are a lot of odd body objects apparently permitted, e.g. GzipFile objects can always seek to the end but may not be able to go backwards, mmap() objects are both bytes-like and file-like.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 9 years ago

Hi again,

first of all, sorry for not contributing to the discussion for such a long time. I was quite busy lately.

I tested the patch with Python 3.5.0a3. It works nicely for my use case. Thanks a lot!

I have one issue though: urllib's HTTPHandler and HTTPSHandler still try to enforce a Content-length to be set (by AbstractHTTPHandler.dorequest()). But for chunked transfer encoding, the Content-length must not be set. Using to this patch, HTTPConnection also checks for the Content-length in _send_request() and sets it if missing. AFAICS, HTTPConnection now does a far better job checking this then HTTPHandler and - most importantly - does it in a way that is compatible with chunked transfer encoding. So, I would suggest, that there is no need for HTTPHandler to care about the content length and that it should just leave this header alone.

E.g., I suggest that the "if not request.has_header('Content-length'): [...]" statement should completely be removed from AbstractHTTPHandler.dorequest() in urllib/request.py.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

Thanks for the report Rolf. I'll look into your suggestion for this patch.

Antoine: Given beta is slated for the 24th and Martin's legitimate concerns, I think it might be a little hasty to get this in before feature freeze. Knocking it back to 3.6 (obviously feel free to change if you think it should remain for any reason).

vadmium commented 9 years ago

This bug and Demian’s patch were originally about changing the low-level http.client.HTTPConnection class. I suggest opening a separate issue for changing urllib.request.urlopen() or similar, to avoid confusion.

It should actually be possible to support chunking in urlopen() without any new support from HTTPConnection; just use the lower-level putrequest(), putheader(), endheader(), and send() methods. However, consideration should be given to how it interacts with handlers like HTTPBasicAuthHandler that resend POST data after getting 401 Unauthorized (similar scenario to bpo-5038).

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 9 years ago

I disagree. I believe that the suggested modification of AbstractHTTPHandler.dorequest() belongs into this change set for the following reasons:

"This module [http.client] defines classes which implement the client side of the HTTP and HTTPS protocols. It is normally not used directly — the module urllib.request uses it to handle URLs that use HTTP and HTTPS." Quote from the http.client documentation. urllib.request is the high level API for HTTP requests. Both modules must fit together. Since urllib's HTTPHandler directly calls HTTPConnection, it can and should rely on the abilities of HTTPConnection.
The code in AbstractHTTPHandler is based on the assumption that each HTTP request having a non empty body must have a Content-length header set. The truth is that a HTTP request must either have a Content-length header or use chunked transfer encoding (and then must not have a Content-length header). As long as the underlying low level module did not support chunked transfer encoding anyway, this assumption might have been acceptable. Now that this change set introduces support for chunked transfer encoding, this assumption is plain wrong and the resulting code just faulty.
This change set introduces a sophisticated determination of the correct content length, covering several different cases, including file like objects and iterables. There is no need any more for the high level API to care about the content length, if this is already done in the low level method. But even worse, all the efforts of HTTPConnection to determine the proper content length is essentially overridden by the rather blunt method in the high level API that get priority and that essentially insists the body to be a buffer like object. This is strange.
The very purpose of this change set is to implement chunked transfer encoding. But this is essentially disabled by a high level API that insists a Content-length header to be set. This is plain silly.

Just to illustrate the current situation, I attach the modified version of my old chunkedhttp.py module, adapted to Demian's patch that I have used to test it. It shows how a user would need to monkey patch the high level API in order to be able to use the features that is implemented by this change in the low level module.

I wouldn't mind to file another issue against urllib.request.HTTPHandler if this makes things easier. But what I really would like to avoid, is to have any Python version in the wild that has this current issue fixed, but HTTPHandler still broken. Having to support a wide range of different Python versions is difficult enough for third party library authors like me. Adding a switch to distinguish between 3.6 (I can use the standard lib right away) and older (I need to replace it be my own old chunkedhttp implementation) is ok. But having to support a third case (the low level HTTPConnection module would work, but I need to monkey patch the high level API in order to be able to use it) in-between would make things awkward. That's why I would prefer to see the fix for HTTPHandler in the same change set.

bb8bd63d-cf82-41f3-a63e-9703d695cb16 commented 9 years ago

FWIW, I was intending to address the issues Rolf raised. Also, I agree that a patch shouldn't knowingly have negative knock-on effects to dependents.

vadmium commented 9 years ago

Okay perhaps a new issue at this stage isn’t the best idea, especially if you want both modules updated at the same time.

With the urlopen() API, it currently has no explicit support for file objects. So you either have to make sure they get treated like any other iterable, or add in explicit support.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Python 3.6.0a2 has been released this week. I guess, we need to speed up a bit. I really would like to see this issue fixed in 3.6.0 final.

I updated Demian's patch to match the current Python sources, see the attachment. I also addressed the issue in urllib.request. It turned out that this was a little bit more involved then just removing the corresponding if statement from AbstractHTTPHandler.dorequest() as suggested above: if the Content-Length has been added by the lib, a caller of urllib might assume to find that header also to be set in the Request object afterwards. At least, some tests do.

But I still believe, urllib.request should refrain from trying to calculate the content length itself, but rather rely on http.client for this, because it is crucial that the interpretation of the various body representations (buffer, file, iterable) while calculating this length matches the code in http.client that reads the body when sending it to the server.

So, I amended Demian's patch as follows:

change the helper _get_content_length() in http.client to a static method of HTTPConnection in order to allow urllib.request to call it.
review get_content_length(), there where a few issues in special cases.
add an optional argument encode_chunked to http.client.HTTPConnection.request(). If set to True, HTTPConnection will do the chunk encoding also if the Transfer-Encoding header is set. This is needed, because urllib.request now sets all required headers in the request, including Transfer-Encoding if needed. But does not handle the chunk encoding itself.
if the request has a non-empty body and Content-Length is not set, urllib.request.AbstractHTTPHandler calls http.client.HTTPConnection.get_content_length to calculate the content length. If this returns None, it sets the Transfer-Encoding: chunked header.

vadmium commented 8 years ago

FYI instead of changing the helper into a static method, I think you could have just called http.client._get_content_length().

I don’t understand why we need the new encode_chunked=True flag. Can’t urllib.request leave the Transfer-Encoding field up to http.client? Yes it does set Content-Length in the Request object, but it does not set Accept-Encoding nor Connection. Also, it looks like you can still get chunked encoding if you set encode_chunked=False, which is confusing.

I left some review comments, and I think some of my earlier comments still apply. I still find it confusing the variety of objects accepted, and the different levels that the APIs work on. Hopefully it will become less confusing if we can figure out all the corner cases. I do think unifying the data types accepted by urlopen() and HTTPConnection is good, though I am not sure allowing text in urlopen() is the right direction.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Martin, thank you for your review! I have to work through the list of comments and get back to you when I'm done. Please note that I didn't got your earlier reviews, so I don't know which comments are new and which are left over from earlier reviews.

Of course, I could also have called http.client._get_content_length(). But I believe it would be bad coding style to call an internal helper function from outside of the module. Furthermore, get_content_length() is closely related to HTTPConnection.send(), because it's vital that both methods share the same interpretation of the various body representations. That is why I felt that this should rather be a method of HTTPConnection. But I do not have a strong opinion about this. I would be happy to change it back if you prefer so.

Concerning the new encode_chunked=True flag in HTTPConnection.request(), I guess there is no ideal solution. I can see the following alternatives, all of them having advantages and problems:

Leave it to http.client. Do not add the Content-Length or Transfer-Encoding header in the Request object in urllib.request. Pro: simple & clean, no extra flag. Con: changes current behavior, possibly breaks existing code.
Add Content-Length, but not Transfer-Encoding in urllib.request. Pro: no extra flag. Con: inconsistent, Content-Length and Transfer-Encoding are closely related and serve the same purpose, so it's unlogic to treat them differently in urllib.
Care about both headers in urllib.request, adding either Content-Length or Transfer-Encoding. Always do the chunk encoding in http.client, also if the Transfer-Encoding header is already set. Pro: the most consistent solution, no extra flag. Con: changes current behavior, possibly breaks existing code.
Care about both headers in urllib.request, add either Content-Length or Transfer-Encoding. Add the extra flag to HTTPConnection.request(). Do the chunk encoding in http.client if either Transfer-Encoding header is not set or if the extra flag is set. Pro: consistent & compatible. Con: extra flag.

In this situation, my patch implements option 4, also because I believe, it's most important to keep the high level API clean and consistent. But I would also be happy to switch to option 2.

And yes, the encode_chunked flag in HTTPConnection.request() is only relevant if the Transfer-Encoding header is set. If Content-Length is set, chunked encoding is illegal anyway. If neither Content-Length nor Transfer-Encoding is set, http.client must decide on its own to set either the one or the other. If it takes this decision, it must also implement it. In either case, the question whether http.client should do the chunk encoding or not, does not arise.

Concerning the variety body types, it is essentially just bytes like, file like, and iterables of bytes like. All three have important use cases, so we need to support them. By text, I guess you mean str? Yes, I agree, disallowing str would make things much cleaner, so I would be happy to drop them. But they are currently explicitly supported in http.client, so I guess, we can't get rid of them.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Martin, I'm almost through with your comments. There is however one item that I need some feedback on: You asked for more documentation on the new EncodingError exception. This was introduced by Demian. It is raised when the Transfer-encoding header set by the caller is invalid. I'm not sure about this. What is the purpose of the HTTPException tree of exceptions in http.client? Is this ought to be a collection of all kinds of exceptions raised in http.client or should a (child of) HTTPException indicate that an exception occurred while talking to the HTTP server? In the latter case, the EncodingError class as introduced now would be just wrong. Maybe we should raise a ValueError instead?

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Here comes a new version of the patch. I believe, I addressed all review comments made on issue12319_7.patch, with the exception of the EncodingError mentioned in the last post.

Most relevant changes compared to last patch:

Do not try to dertermine the content length for text files.
Change priority of object types for the body: try file like before byte like.
Convert byte like to bytes before feeding the body into the socket. This fixes bpo-27340, at least as far as http.client is concerned.
Split test_http() in one separate test for each body type in test_urllib2.py.
Review documentation.

Another question is: what is the best method to test whether a file like is text or binary? For the moment, I kept the original code that essentially does:

try: isText = "b" not in readable.mode except AttributeError: isText = False

This logic fails for StringIO and possibly others. Alternatives could be:

  isText = isinstance(readable, io.TextIOBase)

  isText = hasattr(readable, 'encoding')

or

  isText = isinstance(readable.read(0), str)

What do you guys suggest?

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

*ping*

It's more then four weeks ago that I revised my patch, still waiting for review. I also still need feedback on the question whether the new EncodingError exception introduced by Damien should be kept or whether I should change this to a ValueError.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Thanks for your review! Here comes a new version of the patch. I believe I addressed all review comments.

Since I didn't get any feedback on my questions raised above, I resolved them as follows:

I dropped the EncodingError exception introduced by an earlier version of the patch by Damien. This error is raised when the caller sets an illegal combination of headers. Since the error does not occur as a result of talking to the HTTP server, I think an HTTPException or a child thereof is just wrong. A ValueError is more appropriate here and that is what the code raises now.
I changed the test whether a file-like is a text stream to isText = isinstance(readable.read(0), str). This relies only on the read() method and is the most generic test. The original code would fail for StringIO and other text streams that do not have the mode attribute. Note that I needed to fix a test that violated the file object protocol with a read() method that ignored the size parameter.

vadmium commented 8 years ago

Yes when I say “text” I mean str objects, as opposed to byte strings, etc.

I wonder if it would be safer to test for TextIOBase. With the read(0) hack, the zero size may be unexpected. A file object may need special handling to generate an empty result, or to not interpret it as an EOF indicator. See e.g. bpo-23804, about zero-sized SSL reads, where both of these problems happened. As long as we document that the text file logic has changed, I think it would be okay. IMO it is better to keep things simple and robust for byte files than add more hacks for text files.

A common base class like HTTPException is useful when a caller may need to handle a set of exceptions in a common way. But a problem with Content-Length or Transfer-Encoding seems like a programming error, so ValueError seems fine IMO.

However I still don’t see why the checks are worthwhile. Sure having both is a technical violation of the current HTTP protocols. But if a caller is likely to end up in one of these situations, it may indicate a problem with the API that we should fix instead. If the checks are unlikely to ever fail, then drop them. Otherwise, I don’t think any of the bugs I pointed out have been addressed.

I don’t like converting bytes-like to bytes. I understand one of the main purposes of accepting bytes-like objects is to avoid copying them, but converting to bytes defeats that. Also, the copying is inconsistent with the lack of copying in _read_iterable().

I am wondering if it would be best to keep new logic out of send(); just put it into _send_output() instead. Once upon a time, send() was a simple wrapper around socket.sendall() without much bloat. But with the current patch, in order to send a single buffer, we have to jump through a bunch of checks, create a dummy lambda and singleton tuple, loop over the singleton, call the lambda, realize there is no chunked encoding, etc. IMO the existing support for iterables should have been put in _send_output() too (when that was an option). There is no need to use send() to send the whole body when endheaders() already supports that.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Martin,

I wonder if it would be safer to test for TextIOBase. With the read(0) hack, the zero size may be unexpected. A file object may need special handling to generate an empty result, or to not interpret it as an EOF indicator. See e.g. bpo-23804, about zero-sized SSL reads, where both of these problems happened. As long as we document that the text file logic has changed, I think it would be okay. IMO it is better to keep things simple and robust for byte files than add more hacks for text files.

The problem is that there is no accepted definition of what a file-like object is. As a consequence, each test method will fail in some cases. Not all file-like objects are derived from the base classes in the io module. It is common praxis to simply define an object with a read() method and to consider this a file-like. This pattern can even be found in the official Python test suite. Testing for TextIOBase will fail in these cases.

There seem to be at least a minimal consensus that each file-like (that supports reading) must have the read() method implementing the interface as defined by the io base classes. The read(0) test (I wouldn't call it a hack, as it only uses this documented API) has the advantage of making no other assumptions then this. And I even would not accept bpo-23804 as a counter example as this was clearly a bug in the ssl module.

The current test using the mode attribute is certainly the worst one as it even fails for standard classes from the io module.

Anyway. Even though I would prefer the read(0) test, I am happy to accept any other method. Please tell me what I should use.

However I still don’t see why the checks are worthwhile. Sure having both is a technical violation of the current HTTP protocols. But if a caller is likely to end up in one of these situations, it may indicate a problem with the API that we should fix instead. If the checks are unlikely to ever fail, then drop them.

The question is whether we should apply sanity checks on caller's input or whether we should rather send bullshit out and wait for the request to fail. I'd rather give the caller a somewhat helpful error message rather then a lesser helpful "400 Bad Request" from the server. But if you really insist, I can also remove these checks.

I don't understand your remark on the API and what the likeliness of rubbish header values set by the caller has to do with it, sorry.

Otherwise, I don’t think any of the bugs I pointed out have been addressed.

Could you elaborate on what bugs you are talking about? I really don't know, sorry.

I don’t like converting bytes-like to bytes. I understand one of the main purposes of accepting bytes-like objects is to avoid copying them, but converting to bytes defeats that. Also, the copying is inconsistent with the lack of copying in _read_iterable().

I added this conversion because you mentioned bpo-27340 and the problem to send byte-like objects with the ssl module. I agree to leave this to bpo-27340 to find a better solution and will remove the conversion.

I am wondering if it would be best to keep new logic out of send(); just put it into _send_output() instead. Once upon a time, send() was a simple wrapper around socket.sendall() without much bloat. But with the current patch, in order to send a single buffer, we have to jump through a bunch of checks, create a dummy lambda and singleton tuple, loop over the singleton, call the lambda, realize there is no chunked encoding, etc. IMO the existing support for iterables should have been put in _send_output() too (when that was an option). There is no need to use send() to send the whole body when endheaders() already supports that.

No problem, I can move all this to _send_output(). The only problem that I see then, is that the current library version of send() already accepts file-like (both text and byte streams), bytes, and iterables. So we will end up with two different methods accepting this variety of data types and behaving differently in the respective cases.

Maybe, as a consequence, one should restrict send() to accept only bytes in order to tighten consistency. But this is not in the scope of this Issue and will not be covered by my patch.

Please note that I will be in holidays starting with next Monday. We should try to get this done during this week.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Ok, here comes yet another version of the patch, addressing the review comments.

As indicated in the last post and requested by Martin, I moved the logic to deal with the different body types out of send() into _send_output(). I also removed the conversion from byte-like to bytes.

I had some problems with some comments mainly on the documentation of urllib.request, asking for stuff unrelated to chunked transfer encoding and the present issue. I have the impression this is starting to get out of hands. Time is running out and I suggest that we should keep the focus on the present issue.

There were also requests to drop support for stuff that is working in the current library, although not documented, and that is explicitly covered by the test suite. I'm fine with dropping that and also adapted the tests accordingly. But please be sure that what you are asking for is really what you want.

Please note that this patch is not yet complete. I still need your decision on the two issues raised in the last post: the test method to use for text streams and whether I should drop the sanity checks for the transfer-encoding header.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

I just wrote:

There were also requests to drop support for stuff that is working in the current library, although not documented, and that is explicitly covered by the test suite. I'm fine with dropping that and also adapted the tests accordingly. But please be sure that what you are asking for is really what you want.

Sorry, please disregards this comment. I just realized that these tests were added by previous versions of the patch by Damien.

orsenthil commented 8 years ago

(Assigning this to Martin)

I really like this patch, and I support that we land this before the feature freeze of 3.6.

As along as the API interface to users is simple, the changes internal to _send_output should not be a concern. Yeah, I agree that multiple things needs to be taken care, but it is alright.

Also, +1 to do validation of transfer-encoding usage as it's done by the patch. This is along the lines of RFC recommendation and it is better to fail early than let the server implementation be the decider.

I would like to know the better suggestion for "testing text" too. (Adding haypo to steer us here) Although, I would point out that the static method has a generic name _is_textIO, while relying on .read attribute which is checked outside the static method.

matrixise commented 8 years ago

When I have read the last patch, I have seen a mutable as the default value of the HTTPConnection.request method. I have proposed this issue and the attached patch http://bugs.python.org/issue27728

vadmium commented 8 years ago

_is_textIO(): I’m sorry but I still prefer the TextIOBase check over read(0). Each option has a disadvantage, but with TextIOBase, the disadvantage only affects text files, not byte files.

Maybe another option is to stick with the current checking of the “mode” attribute, or are its failings significant for chunked encoding? It seems to me that the most important use cases would be reading bytes from a pipe file, custom bytes file object reader, generator, etc, but not a text file without a “mode” attribute.

Checking Content-Length and Transfer-Encoding: I would prefer to remove this, unless there is a “good” reason to keep them. I want to understand why the three specific checks were added, and what made them more special than other potential checks (e.g. the format of the Content-Length value, or that the resulting body matches it). If the reason is something like “it is too easy for the caller to accidentally trigger the problem”, then there may be another way to fix it. Or maybe it is a mitigation for a security problem? But at the moment, it just seems to me like code being too smart for its own good.

I mentioned a few possible bugs with parsing Transfer-Encoding at \https://bugs.python.org/review/12319/diff/14900/Lib/http/client.py#newcode1210\. The format of the Transfer-Encoding value is specified at \https://tools.ietf.org/html/rfc7230#section-3.3.1\ and \https://tools.ietf.org/html/rfc7230#section-4\. In bpo-23498, I identified some parts of Python that parse header values like this, although it looks like http.cookiejar.split_header_words() may be the only one usable for Transfer-Encoding.

vadmium commented 8 years ago

Some of the header field validation was included in Demian’s first patch, raising HTTPEncodingError. But I don’t know why he included it. Maybe it was more appropriate then; it looks like that patch took a different approach than the current encode_chunked=True mechanism.

fb48aa51-fd88-46c6-acdd-6fb16edaca7f commented 8 years ago

Martin,

_is_textIO(): I’m sorry but I still prefer the TextIOBase check over read(0). Each option has a disadvantage, but with TextIOBase, the disadvantage only affects text files, not byte files.

Ok, this is a valid argument.

Maybe another option is to stick with the current checking of the “mode” attribute, or are its failings significant for chunked encoding? It seems to me that the most important use cases would be reading bytes from a pipe file, custom bytes file object reader, generator, etc, but not a text file without a “mode” attribute.

No, this is not significant for chunked encoding. As I said, I am happy to accept any method.

But I would say, it's a fairly save bet that any file-like that has the mode attribute is in fact derived from the base classes in the io module. If this assumption holds, then the TextIOBase check will always work in all cases where the mode attribute check would. The inverse is not true. So I'll use the TextIOBase check, hope you agree.

Checking Content-Length and Transfer-Encoding: I would prefer to remove this, unless there is a “good” reason to keep them. I want to understand why the three specific checks were added, and what made them more special than other potential checks (e.g. the format of the Content-Length value, or that the resulting body matches it). If the reason is something like “it is too easy for the caller to accidentally trigger the problem”, then there may be another way to fix it. Or maybe it is a mitigation for a security problem? But at the moment, it just seems to me like code being too smart for its own good.

Ok, I'll drop them.

I mentioned a few possible bugs with parsing Transfer-Encoding at \https://bugs.python.org/review/12319/diff/14900/Lib/http/client.py#newcode1210\. The format of the Transfer-Encoding value is specified at \https://tools.ietf.org/html/rfc7230#section-3.3.1\ and \https://tools.ietf.org/html/rfc7230#section-4\. In bpo-23498, I identified some parts of Python that parse header values like this, although it looks like http.cookiejar.split_header_words() may be the only one usable for Transfer-Encoding.

I admit, I didn't read all comments on Demian's versions of the patch. Since I'll drop the checks, I guess, these problems are gone now.

Some of the header field validation was included in Demian’s first patch, raising HTTPEncodingError. But I don’t know why he included it. Maybe it was more appropriate then; it looks like that patch took a different approach than the current encode_chunked=True mechanism.

I don't think this is in the current versions of the patch any more.

python / cpython

[http.client] HTTPConnection.request not support "chunked" Transfer-Encoding to send data #56528