urllib3 / urllib3

urllib3 is a user-friendly HTTP client library for Python
https://urllib3.readthedocs.io
MIT License
3.77k stars 1.14k forks source link

UnicodeDecodeError in format_header_params #793

Closed jonathan-s closed 8 years ago

jonathan-s commented 8 years ago

This issue was discussed here: https://github.com/kennethreitz/requests/issues/2639 and it seemed like the consensus was that this should be fixed in urllib3.

sigmavirus24 commented 8 years ago

@jonathan-s did you test that this is still a problem?

jonathan-s commented 8 years ago

Yup, it still occurrs.

haikuginger commented 8 years ago

I'd be glad to handle this - relevant code:

https://github.com/shazow/urllib3/blob/571fd737863fa0c60df24bce1fe4972719fd7ed2/urllib3/fields.py#L22-L47

Recapping the use case: this is generally coming into play during multipart/form-data uploads of files with Unicode names when using Python 2.

This isn't coming into play in Python 3; the flow there is that we're only encountering a caught UnicodeEncodeError at line 39, and no UnicodeDecodeError; this is because Python 3 strings are native unicode and only the encode operation is necessary. From there, we're skipping over lines 41 through 44, and just passing the Unicode string to encode_rfc_2231(), which builds RFC-2231-compliant ASCII strings from Unicode.

In comparison, at line 38 in Python 2, we encounter a UnicodeDecodeError. It seems like str.encode() is trying to load the bytes as an ASCII string before dumping the string to ASCII-encoded bytes. If it weren't for this, we'd also encounter a UnicodeEncodeError. If we pass both error types, then we'll encounter another UnicodeDecodeError at line 44, because we'll be again attempting the str.encode operation, inclusive of trying to decode the bytes as ASCII (not possible).

How I want to handle this:

First, I want to catch both error types at line 39.

Second, I don't actually think lines 43 and 44 are necessary - as US-ASCII is a strict subset of UTF-8, and str == bytes in Python 2, the operation isn't actually doing anything. This removes the possibility for the second UnicodeDecodeError, and lets the string pass through encode_rfc_2231(), as is the current behavior in Python 3 (treat the bytes as UTF-8, and escape them appropriately for transmission over ASCII).

Thoughts?

EDIT: Neglected to realize that some form of lines 43 and 44 are necessary to handle Python2 Unicode objects.

shazow commented 8 years ago

@haikuginger Would you be up for making a quick PR for feedback? Visualizing all the changes through written word is painful. :)