r8-forks / webapp-improved

Automatically exported from code.google.com/p/webapp-improved
Other
0 stars 0 forks source link

webapp2 should not set a default charset='utf-8' when creating Request objects #23

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The following in webapp2.Request.__init__ creates erroneous results, forcing a 
'; charset="utf-8"' onto all HTTP request content types which don't include a 
charset.  This results in Request.headers['content-type'] and 
Request.environment.headers['content-type'] not accurately reflecting the HTTP 
request received by the server.  Additionally, it results in broken behaviour 
for POST requests which have a 0 byte body gaining a completely invalid 
content-type of '; charset="utf-8"', although curiously behaves differently for 
a non-zero body length.

The bottom line for me is that webapp2 should not be doing anything which 
breaks fidelity of the received headers — that is something for the users of 
webapp2 to decide how to handle, and a default of utf-8 is incorrect.  Note in 
particular http://www.ietf.org/rfc/rfc2616.txt section 3.4.1 and 
http://www.ietf.org/rfc/rfc2854.txt section 6:

----
3.4.1 Missing Charset

   Some HTTP/1.0 software has interpreted a Content-Type header without
   charset parameter incorrectly to mean "recipient should guess."
   Senders wishing to defeat this behavior MAY include a charset
   parameter even when the charset is ISO-8859-1 and SHOULD do so when
   it is known that it will not confuse the recipient.

   Unfortunately, some older HTTP/1.0 clients did not deal properly with
   an explicit charset parameter. HTTP/1.1 recipients MUST respect the
   charset label provided by the sender; and those user agents that have
   a provision to "guess" a charset MUST use the charset from the
   content-type field if they support that charset, rather than the
   recipient's preference, when initially displaying a document. See
   section 3.7.1.
----
6. Charset default rules

   The use of an explicit charset parameter is strongly recommended.
   While [MIME] specifies "The default character set, which must be
   assumed in the absence of a charset parameter, is US-ASCII."  [HTTP]
   Section 3.7.1, defines that "media subtypes of the 'text' type are
   defined to have a default charset value of 'ISO-8859-1'".  Section
   19.3 of [HTTP] gives additional guidelines.  Using an explicit
   charset parameter will help avoid confusion.

   Using an explicit charset parameter also takes into account that the
   overwhelming majority of deployed browsers are set to use something
   else than 'ISO-8859-1' as the default; the actual default is either a
   corporate character encoding or character encodings widely deployed
   in a certain national or regional community. For further
   considerations, please also see Section 5.2 of [HTML40].
----

The following patch fixes the problem for me, but isn't extensively tested:

diff -r a2bdc641668c webapp2.py
--- a/webapp2.py    Fri Aug 26 14:32:36 2011 -0300
+++ b/webapp2.py    Thu Oct 13 15:27:04 2011 +0100
@@ -120,10 +120,7 @@
             match = _charset_re.search(environ.get('CONTENT_TYPE', ''))
             if match:
                 charset = match.group(1).lower().strip().strip('"').strip()
-            else:
-                charset = 'utf-8'
-
-            kwargs['charset'] = charset
+                kwargs['charset'] = charset

         kwargs.setdefault('unicode_errors', 'ignore')
         kwargs.setdefault('decode_param_names', True)

Original issue reported on code.google.com by paul.j.m...@googlemail.com on 13 Oct 2011 at 2:36

GoogleCodeExporter commented 9 years ago
This issue was closed by revision d36c461b86ba.

Original comment by rodrigo.moraes on 31 Jan 2012 at 6:18