psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.16k stars 9.33k forks source link

Can't use post to upload a file with Chinese characters in its name. #2313

Closed sbarba closed 9 years ago

sbarba commented 10 years ago

This code:

requests.post(url, files={"file": open(u"漢字.o8d", "r")})

will return a 200, but the file is never uploaded.

I can upload that file by posting in the browser so this doesn't seem to be a server-side issue. Also, if I change the name of the file to "bob" or something ASCII it works perfectly.

Lukasa commented 10 years ago

Are you sure?

$ echo "file file file.\n" >> 漢字.o8d
$ ls
漢字.o8d
>>> import requests
>>> r = requests.post('http://httpbin.org/post', files={'file': open(u'漢字.o8d', 'r')})
>>> print r.content
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "file": "file file file.\n"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connect-Time": "2", 
    "Connection": "close", 
    "Content-Length": "180", 
    "Content-Type": "multipart/form-data; boundary=3491ae0e5b6d465aaebb7bd63c9c750c", 
    "Host": "httpbin.org", 
    "Total-Route-Time": "0", 
    "User-Agent": "python-requests/2.4.0 CPython/2.7.8 Darwin/14.0.0", 
    "Via": "1.1 vegur", 
    "X-Request-Id": "f05915c9-279e-4187-8425-f0b06fc64ea2"
  }, 
  "json": null, 
  "origin": "77.99.146.203", 
  "url": "http://httpbin.org/post"
}

Seems like httpbin doesn't have a problem. Can you confirm what version of requests you're using?

Lukasa commented 10 years ago

Oh hang on. Interestingly, httpbin sees it as a form field, not a file object. Hmm.

Lukasa commented 10 years ago

Oh, yes, I remember now.

POSTing files with unicode filenames is awkward, because you didn't say what text encoding you want us to use. There's a spec for this, which we implement, but relatively few others do it and many servers don't understand it.

My suggested workaround would be to set the filename yourself using whatever encoding you choose. Unfortunately, that doesn't work:

Traceback (most recent call last):
  File "testy.py", line 4, in <module>
    r = requests.post('http://httpbin.org/post', files={'file': (u'漢字.o8d'.encode('utf-8'), open(u'漢字.o8d', 'r'))})
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 88, in post
    return request('post', url, data=data, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 434, in request
    prep = self.prepare_request(req)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 372, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/usr/local/lib/python2.7/site-packages/requests/models.py", line 299, in prepare
    self.prepare_body(data, files)
  File "/usr/local/lib/python2.7/site-packages/requests/models.py", line 434, in prepare_body
    (body, content_type) = self._encode_files(files, data)
  File "/usr/local/lib/python2.7/site-packages/requests/models.py", line 151, in _encode_files
    rf.make_multipart(content_type=ft)
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.py", line 173, in make_multipart
    (('name', self._name), ('filename', self._filename))
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.py", line 133, in _render_parts
    parts.append(self._render_part(name, value))
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.py", line 113, in _render_part
    return format_header_param(name, value)
  File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.py", line 37, in format_header_param
    result.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 10: ordinal not in range(128)

The problem here seems to be this line. This unconditional call to encode will actually cause an implicit call to str.decode on Python 2, which breaks for non-ascii characters. @shazow, you prepared to consider that a bug?

sigmavirus24 commented 10 years ago

Django now supports this and was appreciative of the bug report. The fact that httpbin doesn't parse this correctly is a flask/werkzeug bug I think.

sbarba commented 10 years ago

Just discovered that 漢字 is Japanese Kanji and means "Chinese Characters". Enjoyed that, but the bug still stands. For now I'm able to automate testing of such filenames with Selenium, but it'd be nice to do it with requests too.

sigmavirus24 commented 10 years ago

Except it's a bug in the server you're trying to upload to for not supporting a 10 year old RFC

kampde commented 9 years ago

Is there any other workaround different than changing the file name or changing the server backend?

sigmavirus24 commented 9 years ago

I think someone percent-encoded the file name because whatever server they were communicating with understood that. That's behaviour that is not defined anywhere though so it depends on the server your using doing something incredibly bad and horribly wrong.

sigmavirus24 commented 9 years ago

And @kampde thanks searching for prior issues and for not opening a new issue.

kampde commented 9 years ago

The aforementioned RFC is RFC 5987, right?

sigmavirus24 commented 9 years ago

I don't believe so. No. That's for HTTP Headers, not for mime-headers

kampde commented 9 years ago

Looks like RFC 2231 then.

sigmavirus24 commented 9 years ago

@kampde after a quick skim, that is the correct RFC. As you can see it is 18 years old.

zhangchunlin commented 9 years ago

I think in https://github.com/kennethreitz/requests/blob/master/requests/packages/urllib3/fields.py#L37

        try:
            result.encode('ascii')
        except UnicodeEncodeError:
            pass
        else:
            return result

Modify to "result.encode('utf8')" will be better ,because most server can handle with utf8, but many of them do not support the style of "email.utils.encode_rfc2231(value, 'utf-8')"

Lukasa commented 9 years ago

@zhangchunlin What does 'most servers' mean? Which servers? Which versions of those servers? Why don't they implement RFC 2231?

sigmavirus24 commented 9 years ago

@zhangchunlin if those servers do not implement a standard that is 18 years old, I fail to see why we should be forced to violate the standard.

zhangchunlin commented 9 years ago

@Lukasa OK, I didn't test so much, my statement maybe wrong. I just found that the behavior of requests wasn't same as browser(for example chrome), what I thought is that the method chrome using is workable.

@sigmavirus24 I will try to make clear and submit issue to those server if needed.

WishCow commented 9 years ago

It seems PHP is also affected by this, if you try to upload a file to a server running PHP, with the name 'fårikål.txt', it will throw a warning: "PHP Warning: File Upload Mime headers garbled in Unknown on line 0".

This is PHP 5.6.14.

sigmavirus24 commented 9 years ago

@WishCow I'm not certain what result you expect to see if you're filing a PHP bug against another project. It seems frameworks in Perl, Ruby, and Python all appropriately support RFC 2231. If PHP 5.6.14 doesn't support an 18 year old standard, you should file a bug with PHP.

WishCow commented 9 years ago

Just leaving a note here, in case other people encounter this issue, it took me a long time to find the cause.

sigmavirus24 commented 9 years ago

@WishCow you'll probably have a better time putting together some minimal bit of PHP code and filing a bug with PHP. This comment will help others, but filing a bug to get this fixed in PHP would help a lot more people.

WishCow commented 9 years ago

Actually I was about to do that, and I whipped up a quick example of the upload with curl, but that seems to work. Now I'm confused, is there another RFC that describes how filenames should be handled, that curl (and PHP) might be implementing?

So this:

curl -v -F får.txt=@/tmp/test.txt http://myserver.local

Does produce the correct output from the handling PHP script.

sigmavirus24 commented 9 years ago

Run netcat locally and send the curl request to that.

Curl might be violating the RFC because support for the spec has lagged behind.

WishCow commented 9 years ago

The command

curl -F får='@/tmp/test.txt;filename=får.txt' localhost:14511

Results in the netcat output:

POST / HTTP/1.1
Host: localhost:14511
User-Agent: curl/7.45.0
Accept: */*
Content-Length: 198
Expect: 100-continue
Content-Type: multipart/form-data; boundary=------------------------fb94c2e958ada9f0

--------------------------fb94c2e958ada9f0
Content-Disposition: form-data; name="får"; filename="får.txt"
Content-Type: text/plain

hello world

--------------------------fb94c2e958ada9f0--

So curl indeed does not seem to use the *= format that the RFC is describing.

sigmavirus24 commented 9 years ago

Yeah, so you can use httpie to produce a cURL like command that will probably trigger this for you.

sigmavirus24 commented 9 years ago

You could also write some PHP that uses RFC 2231.

WishCow commented 9 years ago

The SO post describes how to send files with the correct encoding, but I need to receive files, for which there doesn't seem to be a way, since the $_FILES superglobal gets populated before the userland script runs.

Thanks for the help though, in case someone else wants to track this in PHP: https://bugs.php.net/bug.php?id=70794

sigmavirus24 commented 9 years ago

@WishCow right, that's what I meant (instead of using curl use PHP).

Robbt commented 5 years ago

So I ran into this issue with a PHP server running Zend 1 and the solution that I came up with was to import urllib and then encode the filename like so files = {'file': (urllib.pathname2url(event.pathname), 'rb')} and it solved the problem for me. Just adding this in case it might help someone else who runs into this.

Robbt commented 5 years ago

That fix proved to introduce new problems because it changed the filenames in weird ways. I'm instead working on getting this PR in urllib3 to use HTML5 encoding vs. rfc2231 by default reopened. Hopefully this will allow this problem to be fixed for requests as well. I managed to rewrite the request I was using with my patched version of urllib3 based upon the currently closed PR and it worked.