psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.02k stars 9.3k forks source link

Traceback when filename contains non-ascii characters #2639

Closed zhaoguixu closed 8 years ago

zhaoguixu commented 9 years ago

A simple test case:

request = requests.Request(method='GET', url='https://httpbin.org/get')
request.files = {'f': ('føø', '\xff')}
prepared = request.prepare()
requests.Session().send(prepared)
    prepared = request.prepare()
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 249, in prepare
    hooks=self.hooks,
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 296, in prepare
    self.prepare_body(data, files)
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 434, in prepare_body
    (body, content_type) = self._encode_files(files, data)
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 148, in _encode_files
    rf.make_multipart(content_type=ft)
  File "/usr/lib/python2.7/dist-packages/urllib3/fields.py", line 176, in make_multipart
    self.headers['Content-Disposition'] += '; '.join(['', self._render_parts((('name', self._name), ('filename', self._filename)))])
  File "/usr/lib/python2.7/dist-packages/urllib3/fields.py", line 139, in _render_parts
    parts.append(self._render_part(name, value))
  File "/usr/lib/python2.7/dist-packages/urllib3/fields.py", line 119, in _render_part
    return format_header_param(name, value)
  File "/usr/lib/python2.7/dist-packages/urllib3/fields.py", line 44, in format_header_param
    result.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

The terrible unicode feather of python2.x will implicitly cast the byte-string to unicode. results contains non-ascii characters which will be casted to unicode first before calling encode method. However the catch in format_header_param UnicodeEnocdeError is not enough. The api specification in this function says value is a unicode string. Actually its a so low-level function. Requests seems ignore this and allow user-decided non-ascii builtin-string flows in, resulting this traceback.

def format_header_param(name, value):
    """
    Helper function to format and quote a single header parameter.

    Particularly useful for header parameters which might contain
    non-ASCII values, like file names. This follows RFC 2231, as
    suggested by RFC 2388 Section 4.4.

    :param name:
        The name of the parameter, a string expected to be ASCII only.
    :param value:
        The value of the parameter, provided as a unicode string.
    """
    if not any(ch in value for ch in '"\\\r\n'):
        result = '%s="%s"' % (name, value)
        try:
            print result
            result.encode('ascii')
        except UnicodeEncodeError:
            pass
        else:
            return result
    if not six.PY3:  # Python 2:
        value = value.encode('utf-8')
    value = email.utils.encode_rfc2231(value, 'utf-8')
    value = '%s*=%s' % (name, value)
    return value

Thanks,

Lukasa commented 9 years ago

Hmm, this one is really tricky.

I think the best fix here is actually in urllib3: we should catch UnicodeDecodeError. However, if we catch it, we should return result. The rationale is that the user has already provided us with a bytestring, so they presumably think they know what they're doing. @shazow, thoughts?

shazow commented 9 years ago

Does this work if the filename is properly encoded in unicode to begin with? If there is no pre-formatted input that works, then yes that should be fixed in urllib3.

zhaoguixu commented 9 years ago

Yes, it works when request.files = {'f': (u'føø', u'\xff')}

shazow commented 9 years ago

Hmmm, so my general rule for urllib3, as a library-level module, is garbage-in-garbage-out. I try to avoid converting between bytes/unicode whenever it's direct input, though sometimes it's necessary if it's indirect. I can be convinced otherwise though.

Lukasa commented 9 years ago

I try to avoid converting between bytes/unicode whenever it's direct input

The problem is that's not true of this function. This function explicitly does exactly that. My proposal is that when it can't, it should just let things keep going.

shazow commented 9 years ago

Oh woops didn't even realize format_header_param was in urllib3, thought it was a requests thing. In that case, yesh +1.

sigmavirus24 commented 8 years ago

Will be fixed when we next pull in a new version of urllib3.