psf / requests

A simple, yet elegant, HTTP library.
https://requests.readthedocs.io/en/latest/
Apache License 2.0
52.02k stars 9.3k forks source link

Multipart files unicode filename #2117

Closed homm closed 10 years ago

homm commented 10 years ago

Starting from 2.0 requests does not send filename attribute of Content-Disposition header for multipart files with unicode names. Instead of this attribute with name filename* is sent.

# requests 1.2.3
>>> requests.post('http://ya.ru', files={'file': (u'файл', '123')}).request.body
'--db7a9522a6344e26a4ca2933aecad887\r\nContent-Disposition: form-data; name="file"; filename="\xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb"\r\nContent-Type: application/octet-stream\r\n\r\n123\r\n--db7a9522a6344e26a4ca2933aecad887--\r\n'

# requests 2.0
>>> requests.post('http://ya.ru', files={'file': (u'файл', '123')}).request.body
'--a9f0de2871da46df86140bc5b72fc722\r\nContent-Disposition: form-data; name="file"; filename*=utf-8\'\'%D1%84%D0%B0%D0%B9%D0%BB\r\n\r\n123\r\n--a9f0de2871da46df86140bc5b72fc722--\r\n'

And this is a big problem, because looks like some systems does not recognize such fields as files. At least we encountered a problem with Django. Django places entire file's content in request.POST instead of request.FILES. It is clear from sources: https://github.com/django/django/blob/1.7c1/django/http/multipartparser.py#L599-L601

sigmavirus24 commented 10 years ago

This is actually a bug in Django then. Using this syntax is what we're supposed to be using. We have to indicate to the server that we're sending a field whose content is not ASCII or Latin-1 (ISO-8859-1). The proper way to do so is the syntax you see there. It is defined in Section 3.2.1 of RFC 5987. This bug should be filed against Django instead for not conforming to the proper handling of that value. Edit Note specifically that parameter is redefined as reg-parameter or ext-parameter where ext-parameter is defined as parmname (e.g., filename) concatenated with a * character followed by = and the ext-value. This confirms that this is the proper handling of those header values. I believe this RFC also is applied to MIME header values which are what you're using in your multipart/form-data upload.

homm commented 10 years ago

I test php 5.5.9 server, and it also does not understand form-data with filename* attribute as files. I think requests should be close to real world then to rfc, especially when compromise is possible. How about both attributes?

Lukasa commented 10 years ago

It's unclear what the compromise should be. What text encoding should we use?

homm commented 10 years ago

Compromise is using both attributes. Encoding is not so important. It can be even filename=blah.bin, this will be enough for server to recognize files.

Lukasa commented 10 years ago

Filename is hugely important. If we just make random choices, I guarantee that people will simply raise a new bug report claiming that we mangled their filenames.

If you don't like the way we do it, encode the filename yourself. =)

homm commented 10 years ago

Feel the difference: "I can't even upload file to majority of servers" and "server does not understand file name".

There are two cases:

  1. server understand unicode. It will use filename* property. No problems.
  2. server does not understand unicode. It will not be able to receive files OR It will not be able to recognize name. Wait, it already does not understand unicode, it will not be able to recognize name in any case.
homm commented 10 years ago

By the way, before 2.0 requests use some encoding, and as I know everything works well.

Lukasa commented 10 years ago

Resist the temptation to say "Understands unicode". That phrase is meaningless when it comes to networked connections. The server needs to understand the specific text encoding we've chosen, and there is no consistency here. We will get this wrong, and what happens in this situation is totally unclear. Some servers will decode the text and get lucky, and read the filename as something totally obtuse and unclear, and that would be terrible.

I would rather fail clearly and allow the user to make the choice than to guess and get it wrong. This is not PHP.

sigmavirus24 commented 10 years ago

Compromise is using both attributes

How would using both attributes work? If you read the grammar I linked to, it updates the meaning to say that you can only send one, i.e., either filename= or filename*=. If you send both, you're sending the wrong thing.

By the way, before 2.0 requests use some encoding, and as I know everything works well.

It worked purely by chance that those characters could be encoded to bytes. We were just sending whatever the bytestring was (if you look at the string it is what you would get if you took your original string and called encode('<encoding>'). You get the literal bytes and that's what we sent. You could potentially still encode it yourself and then pass that as the filename. You really should only be giving us bytes anyway.

Seeing as this is not a bug, and there is no good reason for us to change how we handle what is given us by the user, I'm closing this.

homm commented 10 years ago

How would using both attributes work? If you read the grammar I linked to, it updates the meaning to say that you can only send one. If you send both, you're sending the wrong thing.

This is totally wrong.

This specification suggests that a
parameter using the extended syntax takes precedence.  This would
allow producers to use both formats without breaking recipients that
do not understand the extended syntax yet.
sigmavirus24 commented 10 years ago

Interesting. I had forgotten that we could send both. Regardless, we can't provide an ASCII representation for you because we can't guess at how to generate ASCII from what you give us. You would still have to generate the value yourself because there's no sane way to provide that functionality via an API.

homm commented 10 years ago

When I try to encode filename by myself as requests 1.2.3 does, I've got a error:

>>> requests.post('http://ubuntu:8000/', files={'file': (u'файл.png'.encode('utf-8'), '123')})
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 10: ordinal not in range(128)

I can't use just ascii representation for filename, because there is no way to send both filename with ascii and encoded filename*.

Lukasa commented 10 years ago

There's so much going on in that line of code I don't know where the breakage is. Can you split that up into multiple lines so I can see which action causes that traceback?

homm commented 10 years ago
>>> filename = u'файл.png'.encode('utf-8')
>>> requests.post('http://ubuntu:8000/', files={
  'file': (filename, '123')
})
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-5-42737369472a> in <module>()
----> 1 import requests; print requests.post('http://ubuntu:8000/', files={'file': (u'файл.png'.encode('utf-8'), '123')})

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/api.pyc in post(url, data, **kwargs)
     86     """
     87 
---> 88     return request('post', url, data=data, **kwargs)
     89 
     90 

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/api.pyc in request(method, url, **kwargs)
     42 
     43     session = sessions.Session()
---> 44     return session.request(method=method, url=url, **kwargs)
     45 
     46 

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert)
    420             hooks = hooks,
    421         )
--> 422         prep = self.prepare_request(req)
    423 
    424         proxies = proxies or {}

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/sessions.pyc in prepare_request(self, request)
    358             auth=merge_setting(auth, self.auth),
    359             cookies=merged_cookies,
--> 360             hooks=merge_hooks(request.hooks, self.hooks),
    361         )
    362         return p

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/models.pyc in prepare(self, method, url, headers, files, data, params, auth, cookies, hooks)
    295         self.prepare_headers(headers)
    296         self.prepare_cookies(cookies)
--> 297         self.prepare_body(data, files)
    298         self.prepare_auth(auth, url)
    299         # Note that prepare_auth must be last to enable authentication schemes

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/models.pyc in prepare_body(self, data, files)
    430             # Multi-part file uploads.
    431             if files:
--> 432                 (body, content_type) = self._encode_files(files, data)
    433             else:
    434                 if data:

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/models.pyc in _encode_files(files, data)
    147             rf = RequestField(name=k, data=fp.read(),
    148                               filename=fn, headers=fh)
--> 149             rf.make_multipart(content_type=ft)
    150             new_fields.append(rf)
    151 

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in make_multipart(self, content_disposition, content_type, content_location)
    173         """
    174         self.headers['Content-Disposition'] = content_disposition or 'form-data'
--> 175         self.headers['Content-Disposition'] += '; '.join(['', self._render_parts((('name', self._name), ('filename', self._filename)))])
    176         self.headers['Content-Type'] = content_type
    177         self.headers['Content-Location'] = content_location

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in _render_parts(self, header_parts)
    136         for name, value in iterable:
    137             if value:
--> 138                 parts.append(self._render_part(name, value))
    139 
    140         return '; '.join(parts)

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in _render_part(self, name, value)
    116             The value of the parameter, provided as a unicode string.
    117         """
--> 118         return format_header_param(name, value)
    119 
    120     def _render_parts(self, header_parts):

/home/homm/env/env_ucare/local/lib/python2.7/site-packages/requests/packages/urllib3/fields.pyc in format_header_param(name, value)
     41         result = '%s="%s"' % (name, value)
     42         try:
---> 43             result.encode('ascii')
     44         except UnicodeEncodeError:
     45             pass

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 10: ordinal not in range(128)
Lukasa commented 10 years ago

Hmm, that really looks like that should catch UnicodeDecodeErrors as well, to allow for exactly this case. Or, even more accurately, should check the type of the string, and not bother encoding bytestrings.

Lukasa commented 10 years ago

Recommend we open a bug report on urllib3 if @shazow agrees.

homm commented 10 years ago

I have checked Tornado and Werkzeug, no one of them don't know about filename*. Do you know any web server who reads rfc in same way as you?

homm commented 10 years ago

Rails as well.

Lukasa commented 10 years ago

I know that all the major browser user agents do, see here.

sigmavirus24 commented 10 years ago

If you're doubtful whether that RFC applies to it, the same grammar is used in Section 4 of RFC 2231 which is more directly related to MIME headers and encoding non-ASCII values in them.

sigmavirus24 commented 10 years ago

There's also a related discussion on the Django ticketing system that shows the team knows this is something they have to do but hasn't acted on (in over a year).

sigmavirus24 commented 10 years ago

And for the record, it isn't just a rails issue, it is a rack problem (which does not have a corresponding issue, yet).

homm commented 10 years ago

Until issues there are in majority of web servers, the issue is in requests.

Lukasa commented 10 years ago

@homm What is your proposal for requests? The best option I've seen so far is to send both filename and filename* in that order. If that happens, what should we put in filename?

sigmavirus24 commented 10 years ago

@homm you're inherently wrong. Rack claims to support RFC 2231, this is clearly a failure to properly do so given how clear 2231 (and 5987) are in their grammar. You claim earlier versions of requests do not cause the problem, the solution to your problem then is to clearly use old versions since you seem to think broken behaviour is correct.

Lukasa commented 10 years ago

@sigmavirus24 I am open to sending both filename and filename*, this should be supported, but I don't know how we'd populate filename.

homm commented 10 years ago

@Lukasa 1) whatever is enough for file recognition for not compliant server and does not brake others (because "specification suggests that a parameter using the extended syntax takes precedence"). 2) it can be value without any non-ascii chars and "unknown" as fallback. 3) should be a way to send encoded filename if I exactly know how server handles encoded values.

homm commented 10 years ago

@sigmavirus24 I can't use old requests version because I need SNI on python 2 :(

Lukasa commented 10 years ago

The problem is that 1) and 2) are wildly ambiguous. Suppose you provide the string u"使用的是". What should our fallback name be? How can we determine it? What do we do when a user wants to provide that fallback name themselves?

sigmavirus24 commented 10 years ago

@Lukasa ostensibly through the toolbelt since that's the only place where users have complete control over the fields but even so, this would probably need to be worked on in urllib3 because I'm fairly certain it doesn't currently provide a way to do this.

Further RFC 2231 is the authority on this since 5987 (as I've re-read it) is for HTTP Headers (not MIME/multipart headers) and 2231 does not allow for multiple parameters (on quick glance over it)


@homm

2) it can be value without any non-ascii chars and "unknown" as fallback.

This is not an accurate representation in the slightest. At best it will introduce utter confusion.

3) should be a way to send encoded filename if I exactly know how server handles encoded values.

You can never know exactly how any server handles these values unless you've written it from scratch yourself.

Lukasa commented 10 years ago

Let's be clear on our thinking here @homm. What we need is something that satisfies criteria 1) and 2) in an unambiguous way while providing a pleasant API. I contend that no such solution exists, because criteria 1) and 2) are ambiguous and resolving that ambiguity dirties up the API.

Requests does not need to be able to generate every valid HTTP request under the sun, only a subset of them. We've decided how we're approaching this, and we're backed up by the specs. We've also provided you with a workaround. That workaround reveals a bug in urllib3 that we're going to pursue fixing.

Unless you can design a pleasant, non-ambiguous API, we're not going to be too worried about this.

homm commented 10 years ago

You can never know exactly how any server handles these values unless you've written it from scratch yourself.

I know how exactly servers work. Almost all of them exactly does not understand parameters placed in files argument of requests.post function.