(TK-145) fix application of default charset

KevinCorcoran commented 9 years ago

Fix a bug in which a default charset of UTF-8 would be added to the Content-Type header for certain requests. This change restricts that behavior to cases in which the body is a string and there is no charset provided by the caller.

@cprice404 @camlow325 If you two fine gentlemen would please review this, that'd be swell.

cprice404 commented 9 years ago

Travis tests were failing due to the JDK SSLv3 security release, which @nwolfe fixed in #35 . This needs a rebase now.

cprice404 commented 9 years ago

The code looks reasonable; I wish I had a bit more recollection of why we added that stuff in the first place--what failures it was causing, just to make sure we don't think this will cause any regressions. But I trust that you did the archeology on that, and the logic here seems reasonable.

KevinCorcoran commented 9 years ago

I wish I had a bit more recollection of why we added that stuff in the first place--what failures it was causing, just to make sure we don't think this will cause any regressions. But I trust that you did the archeology on that

I did. To shed a little light, the main point is probably, IIUC, there were two separate but related issues - the first being this issue with the default charset of UTF-8 vs. ISO-8859-1. The other, and seemingly much more important issue, was that, according to https://github.com/puppetlabs/clj-http-client/commit/f83d7821adde096dbc25d3ab48c4a2e70f837dac ...

the request body string was always encoded to ISO-8859-1 and was not necessarily in sync with the value of the Content-Type header

These changes shouldn't cause the content-type header to get out of sync with the body, and I trust that some tests would fail if such a regression were introduced.

camlow325 commented 9 years ago

The important thing that the original change addressed was the ambiguity in how the client would present the encoded body for the request.

Consider a request submitted to clj-http-client which included the following header:

Content-Type: application/json

Original Behavior before f83d782 was done:

With no charset in the original Content-Type, the clj-http-client library ended up picking a default, ISO-8859-1, for the encoding.

The receiver, not seeing a charset in the header, would pick its own default for decoding, in this case UTF-8 (probably because application/json's inherent default is UTF-8).

Behavior with f83d782:

With no charset in the original Content-Type, the clj-http-client library again picked a default, although we changed the default to be UTF-8. More importantly, the charset was specifically appended to the Content-Type header, resulting in the following being sent on the wire:

Content-Type: application/json; charset: utf-8

Because the encoding that was performed on the request body always agreed with the charset in the Content-Type header, the receiver would be able to reliably use the charset value in the Content-Type header to decode the body and not have to resort to using its own default.

Behavior with this PR

For "string" type request bodies, the behavior should be exactly the same as with f83d782. The only change this PR effectively introduces is that for InputStream type request bodies, clj-http-client won't add a charset to the Content-Type header. That seems totally appropriate to me in that the library isn't actually encoding anything for this kind of request - just passing through the raw bytes. So a Content-Type of "application/octet-stream" passed into the clj-http-client request will be preserved as "application/octet-stream" rather than appending a bogus "charset" attribute, which would be totally inappropriate / potentially confusing to the receiver.

KevinCorcoran commented 9 years ago

@camlow325 @cprice404 I added 95a88286ce62444113719008f5b1b0eca8b92535 to this PR, which applies the same change to the Java client.

And thanks for the good explanation, @camlow325

camlow325 commented 9 years ago

Maybe the only long-term problem with this approach could be for cases where the inherent default for a MIME type is not UTF-8 and the MIME type is one that doesn't technically support the specification of a charset, i.e, is "hardcoded" to always be something other than UTF-8. In that case, the receiver might ignore the charset and choose to decode with something other than UTF-8 and end up with bad data.

Not sure if there are any specific MIME types that fall into that camp? If we do ever encounter those, we might need to go with a solution that has special handling for specific MIME types. I'd prefer to wait on that, though, until we know we have a problem.

KevinCorcoran commented 9 years ago

@camlow325 I think that issue is valid, although I don't know if there are any MIME types that fall into the bucket you described. However, the patches in this PR would not change the behavior of this library in that case - it's not making things any worse than they already may be. So, I'd prefer to consider that as a separate issue and not hold up this PR for it.

camlow325 commented 9 years ago

@KevinCorcoran Yep, I agree, and I wouldn't hold up the PR for that case anyway.

cprice404 commented 9 years ago

Would be nice to capture the meat of the above explanation in the git commit messages and/or the jira tickets. It'll be easier to find them there in the future than it would if it's only in the PR commentary.

I'm +1 on this whenever @camlow325 is.

camlow325 commented 9 years ago

:+1:

puppetlabs / clj-http-client

(TK-145) fix application of default charset #34