python / cpython

The Python programming language
https://www.python.org
Other
63.33k stars 30.32k forks source link

cgi module cannot handle POST with multipart/form-data in 3.x #49203

Closed 5701fa76-724e-4c46-b8f8-fdb0d85bd9b6 closed 13 years ago

5701fa76-724e-4c46-b8f8-fdb0d85bd9b6 commented 15 years ago
BPO 4953
Nosy @warsaw, @birkenfeld, @amauryfa, @vstinner, @merwok, @bitdancer, @florentx, @PierreQuentel
Files
  • full_source_and_error.zip: full source and error show
  • opsuper.pl: Perl code for process post files from form
  • unittest.zip: cgi.FieldStorage unittest for Multipart form-data and associated files.
  • http.zip: Module cgi_new.py and tests
  • cgitest-python3.py
  • cgi_tests.zip: cgi_test.py and associated files
  • adder.html: uses get, works
  • adderpost.html: just hanged get to post, makes adder.cgi hang
  • localCGIServer.py: my wrapper around http.server to handle localhost
  • adder.cgi: action script used by adder.html. adderpost.html, hangs with adderpost.html
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'library', 'expert-unicode'] title = 'cgi module cannot handle POST with multipart/form-data in 3.x' updated_at = user = 'https://bugs.python.org/oopos' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Library (Lib)', 'Unicode'] creation = creator = 'oopos' dependencies = [] files = ['12753', '12768', '14235', '20217', '20270', '20385', '20393', '20394', '20395', '20396'] hgrepos = [] issue_num = 4953 keywords = ['patch'] message_count = 130.0 messages = ['79892', '79893', '79894', '79896', '79898', '79901', '79939', '79981', '80000', '88960', '88962', '89112', '91444', '91449', '91711', '91741', '95288', '95292', '95330', '107959', '108221', '121864', '125035', '125065', '125086', '125088', '125100', '125105', '125106', '125108', '125114', '125152', '125153', '125158', '125159', '125178', '125181', '125201', '125217', '125235', '125237', '125241', '125402', '125403', '125410', '125419', '125426', '125428', '125474', '125481', '125483', '125501', '125524', '125533', '125543', '125546', '125556', '125557', '125558', '125563', '125570', '125583', '125629', '125633', '125637', '125698', '125839', '125840', '125885', '125886', '125892', '125893', '125900', '125901', '125921', '125926', '125928', '125930', '125931', '125935', '125952', '125992', '125993', '125994', '126035', '126060', '126062', '126065', '126066', '126075', '126117', '126124', '126140', '126145', '126152', '126160', '126161', '126162', '126164', '126165', '126167', '126173', '126175', '126199', '126205', '126207', '126210', '126212', '126214', '126215', '126219', '126222', '126232', '126233', '126242', '126249', '126251', '126253', '126256', '126257', '126262', '126264', '126266', '126267', '126268', '126269', '126289', '126293', '126301', '126307'] nosy_count = 17.0 nosy_names = ['barry', 'georg.brandl', 'amaury.forgeotdarc', 'ggenellina', 'vstinner', 'andyharrington', 'eric.araujo', 'grahamd', 'v+python', 'r.david.murray', 'oopos', 'tcourbon', 'tobias', 'flox', 'pebbe', 'quentel', 'erob'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = 'needs patch' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue4953' versions = ['Python 3.2'] ```

    5701fa76-724e-4c46-b8f8-fdb0d85bd9b6 commented 15 years ago

    Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win32 --------------------------------------------- Hi,I user Python!

    UnicodeDecodeError Python 3.0: G:\cgi\python\python.exe Thu Jan 15 16:46:34 2009

    A problem occurred in a Python script. Here is the sequence of function calls leading up to the error, in the order they occurred. G:\webserver\xampp\cgi-bin\testupload.py in () 107 108 # get form 109 opsform = cgi.FieldStorage() 110 111 print ("\<br>","form-data:","\<br>",opsform,"\<br>") opsform undefined, cgi = \<module 'cgi' from 'G:\cgi\python\lib\cgi.py'>, cgi.FieldStorage = \<class 'cgi.FieldStorage'> G:\cgi\python\lib\cgi.py in __init__(self=FieldStorage(None, None, []), fp=None, headers={'content-length': '671631', 'content-type': 'multipart/form-data; boundary=---------------------------9699301019407'}, outerboundary='', environ=\<os._Environ object at 0x00C90C50>, keep_blank_values=0, strict_parsing=0) 477 self.read_urlencoded() 478 elif ctype[:10] == 'multipart/': 479 self.read_multi(environ, keep_blank_values, strict_parsing) 480 else: 481 self.read_single() self = FieldStorage(None, None, []), self.read_multi = \<bound method FieldStorage.read_multi of FieldStorage(None, None, [])>, environ = \<os._Environ object at 0x00C90C50>, keep_blank_values = 0, strict_parsing = 0 G:\cgi\python\lib\cgi.py in read_multi(self=FieldStorage(None, None, []), environ=\<os._Environ object at 0x00C90C50>, keep_blank_values=0, strict_parsing=0) 597 # Create bogus content-type header for proper multipart parsing 598 parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type, ib)) 599 parser.feed(self.fp.read()) 600 full_msg = parser.close() 601 # Get subparts parser = \<email.feedparser.FeedParser object at 0x00DD5650>, parser.feed = \<bound method FeedParser.feed of \<email.feedparser.FeedParser object at 0x00DD5650>>, self = FieldStorage(None, None, []), self.fp = \<io.TextIOWrapper object at 0x00BE3FB0>, self.fp.read = \<bound method TextIOWrapper.read of \<io.TextIOWrapper object at 0x00BE3FB0>> G:\cgi\python\lib\io.py in read(self=\<io.TextIOWrapper object at 0x00BE3FB0>, n=-1) 1722 # Read everything. 1723 result = (self._get_decoded_chars() + 1724 decoder.decode(self.buffer.read(), final=True)) 1725 self._set_decoded_chars('') 1726 self._snapshot = None decoder = \<encodings.gbk.IncrementalDecoder object at 0x00DB7AB0>, decoder.decode = \<built-in method decode of IncrementalDecoder object at 0x00DB7AB0>, self = \<io.TextIOWrapper object at 0x00BE3FB0>, self.buffer = \<io.BufferedReader object at 0x00BE3F90>, self.buffer.read = \<bound method BufferedReader.read of \<io.BufferedReader object at 0x00BE3F90>>, final undefined

    UnicodeDecodeError: 'gbk' codec can't decode bytes in position 157-158: illegal multibyte sequence args = ('gbk', b'-----------------------------9699301019407\r\n...-----------------------------9699301019407--\r\n', 157, 159, 'illegal multibyte sequence') encoding = 'gbk' end = 159 object = b'-----------------------------9699301019407\r\n...-----------------------------9699301019407--\r\n' reason = 'illegal multibyte sequence' start = 157 with_traceback = \<built-in method with_traceback of UnicodeDecodeError object at 0x00DB7BF0>

    l:\tmp_dir\tmpxyeojf.html contains the description of this error.

    ------------------------------------------- Hi, I am newbie for python under the windows. I find that the cgi module always work wrong for the binary files to upload. I find that it cannot auto to discern the files' mode and alway use the default mode : 'TEXT'. So I want to change the sys.stdin 's mode to BINARY to support the binary files. I got this way: import msvcrt,os msvcrt.setmode(0,os.O_BINARY) # for stdin msvcrt.setmode(1,os.O_BINARY) # for stdout but it isn't work,too. I know use C progam language can use this function: freopen("somefilename","mode","stdin or stdout") to redirect the file flow. Can every one help me ?

    Best Regards oopos

    I

    amauryfa commented 15 years ago

    Does it work if you change your script like this: opsform = cgi.FieldStorage(open(sys.stdin.fileno(), 'rb'))

    5701fa76-724e-4c46-b8f8-fdb0d85bd9b6 commented 15 years ago

    To Amaury Forgeot d'Arc :

    Thank you.

    That error have sloved with your way: [quote]Does it work if you change your script like this: opsform = cgi.FieldStorage(open(sys.stdin.fileno(), 'rb'))[/quote]

    Now,The new problem come out: [code] 97 """Push some new data into this object."""

    98 # Handle any previous leftovers

    99 data, self._partial = self._partial + data, ''

    100 # Crack into lines, but preserve the newlines on the end of each

    101 parts = NLCRE_crack.split(data)

    data = b'-----------------------------7d91f41a302f4 \nCo...\x0e\x0f\x0c\x10\x17\x14\x18\x18\x17\x14\x16\x16', self = \<email.feedparser.BufferedSubFile object at 0x00DD5270>, self._partial = ''

    TypeError: Can't convert 'bytes' object to str implicitly [/code]

    I find that the CGI LIB didn't use bytes flow, it always use string flow.

    More info in the attch file:

    amauryfa commented 15 years ago

    OK, another try. Please replace the previous version with these 3 lines:

      encoding = os.environ.get('HTTP_TRANSFER_ENCODING')
      stdin = open(sys.stdin.fileno(), 'r', encoding=encoding)
      opsform = cgi.FieldStorage(stdin)
    5701fa76-724e-4c46-b8f8-fdb0d85bd9b6 commented 15 years ago

    Thank you for time. Now,I try with you saied.Bu it is taken wrong as before.

    See the files:

    amauryfa commented 15 years ago

    Thanks for the test case. I reproduced it easily. There is indeed a real problem in CGI streams.

    The first thing to do is to start python with the -u option (add it to the end of the first #! line), so that stdin yields bytes instead of unicode chars, and \r\n are not translated on Windows.

    Even then, I noticed that in the multipart/form-data section, text fields are utf-8 encoded, but the file content is raw binary. (FWIW, I use Firefox and Apache on Windows) No encoding seems to be specified, neither in the content, nor in the environment (no HTTP_TRANSFER_ENCODING)

    And of course, the email.parser.FeedParser object used to parse it accepts only unicode, not bytes. Help needed.

    1fd7a44c-f7f2-43ed-9c9f-bafa512b8598 commented 15 years ago

    An attempt to more accurately describe the issue, to attract more knowledgeable people, I hope...

    5701fa76-724e-4c46-b8f8-fdb0d85bd9b6 commented 15 years ago

    Hehe. I only want to use python to slove the upload files instead of PHP.but it is hard to me or I have no much time to leran it. Now,I learn Perl quickly and use it upload files, it works ok.

    Thank you . I will take more time to learn Python language well. Best Regards!

    This is my code: (Perl Language)

    1fd7a44c-f7f2-43ed-9c9f-bafa512b8598 commented 15 years ago

    You should stick to Python 2.6 (or even 2.5) for web programming - 3.0 is not mature enough. I thought this was a feasibility study on porting an existing application to Python 3.0 -- not your first steps in the language.

    c706bbe2-5492-45da-85ba-e1394ee5d891 commented 15 years ago

    I'm working on a web framework for Python 3. Naturally this is a blocker for me. I was kinda expecting this to be addressed in 3.1 but now that rc1 is out and I don't see anything about it, I'm wondering about the status of this bug. Can we get a status update?

    bitdancer commented 15 years ago

    Can you provide a test case that clearly demonstrates the problem (preferably a unit test, but anything easily reproducible will do)? I'm not sure what to do with the code attached to the case.

    c706bbe2-5492-45da-85ba-e1394ee5d891 commented 15 years ago

    I've attached unittest.zip. Simply unzip it to a directory and run it. I've included a Python2.x version of the unittest for the sake of clarity. The 2.x version was developed on 2.4. The 3.x version was developed on 3.0.1 and 3.1rc1 (with identical results).

    It seems that there are several issues with cgi.FieldStorage and multi-part form data.

    3ffd31c1-d038-4cdd-9f2b-db35d2fe89c0 commented 15 years ago

    Actually, I think this whole issue is more complex. For example, consider a (fictious) CGI script where users can upload an image and a description and the script sends a success/error message in return.

    In this case, one has to:

    Although FieldStorage only cares about reading, it still has to cope with intermixed textual and binary data. So the only practical way in my opinion is to use raw binary data and have FieldStorage decode strings on demand, since only the programmer knows whether a field should contain text or binary data. FieldStorage should offer two methods for this purpose: one for reading binary data and another for reading and decoding strings on-the-fly (possibly using a default encoding passed to its constructor).

    c706bbe2-5492-45da-85ba-e1394ee5d891 commented 15 years ago

    I think you hit the nail on the head. Now we just need (someone) to code it.

    c706bbe2-5492-45da-85ba-e1394ee5d891 commented 15 years ago

    I thought I'd take a crack at this today. I soon figured out the real issue. It is the email.parser module that handles the decoding of Multipart/form-data things...and it is also still quote broken w.r.t. handling Bytes. So this issue is dependent on http://bugs.python.org/issue4661 before it can be fixed.

    warsaw commented 15 years ago

    Please, please, please contact the email-sig and help pitch in. For many reasons I simply haven't had the cycles to work on this and I don't see that happening any time soon. There are folks willing to work on the package in the email-sig and I will add my $0.02 with design suggestions, but we really need manpower and motivation to get the email package into shape.

    a740bc66-8e71-4a24-b234-6b9a1bf8c85c commented 14 years ago

    *bump*

    Hi there Pythoners !

    As Timothy Farrell I'm currently working (or rather, toying since it just for fun) on a Python3 web framework. I just started but when it come to file upload I experience issues which I believe are connected to that issue.

    Just want if their was any progress since last messages or I should jump in and try writing my own multipart parser (as I doubt I have the required skill to contribute to the email parser).

    c706bbe2-5492-45da-85ba-e1394ee5d891 commented 14 years ago

    Perhaps this update should go in the linked email bug. The email team has a goal of fixing the email module in time for the 3.2 release. I also, feel as though I lack the skill to fix the email module, but it goes beyond that since they're potentially having to change some interfaces. See this document: http://wiki.python.org/moin/Email%20SIG/DesignThoughts

    Once the email module is fixed, the cgi module will be trivial to fix. I'm confident enough to handle it. I don't think anyone can give you a date so you'll have to make the custom solution decision based on your timeframe and patience.

    a740bc66-8e71-4a24-b234-6b9a1bf8c85c commented 14 years ago

    It seems that there wasn't work on that issue (which look complicated by the way). I'll wait, there is so much other aspects of a web framework to play with :) Thank anyway for the pointer.

    gvanrossum commented 14 years ago

    Could this be related to bpo-8077?

    c706bbe2-5492-45da-85ba-e1394ee5d891 commented 14 years ago

    Yes, they are related but not quite the same. The solution to this problem will likely include a fix to the problem described in bpo-8077.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Regarding http://bugs.python.org/issue4953#msg91444 POST with multipart/form-data encoding can use UTF-8, other stuff is restricted to ASCII!

    From http://www.w3.org/TR/html401/interact/forms.html: Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.

    Hence cgi formdata can safely decode text fields using UTF-8 decoding (experimentally, that is the encoding used by Firefox to support the entire ISO10646 character set).

    006d7b03-f9f3-44da-bb8d-8d7a3e15fee9 commented 13 years ago

    Hi,

    I have started working on the port of a simplified version of Karrigell (a web framework) to Python3. I experienced the same problem as the other posters : in the current version, file upload doesn't work. So I've been working on the cgi module for a few days and now have a version which correctly manages file uploads in the tests I made

    The problem in the current version (3.2b2) is that all data is read from sys.stdin, which reads strings, not bytes. This obviously can't work properly to upload binary files. In the proposed version, for multipart/form-data type, all data is read as bytes from sys.stdin.buffer ; in the CGI script, the Python interpreter must be launched with the -u option, as suggested by Amaury, otherwise sys.stdin.buffer.read() only returns the beginning of the data stream

    The headers inside the multipart/form-data are decoded to a string using sys.stdin.encoding and passed to a FeedParser (which requires strings) ; then the data is read from sys.stdin.buffer (bytes) until a boundary is found

    If the field is a file, the file object in self.file stores bytes, and the attribute "value" is a byte string. If it is not a file, the value is decoded to a string, always using sys.stdin.encoding, as for all other fields for other types of forms

    Other cosmetic changes :

    Attached file : zip with cgi_new.py and tests in a folder called "http" Tested with Python 3.2b2 (r32b2:87398, Dec 19 2010, 22:51:00) [MSC v.1500 32 bit (Intel)] on win32 ; files and CGI scripts served by Apache 2.2

    bitdancer commented 13 years ago

    Thank you very much for working on this! I'll try to take a look at the patch soon. A couple quick comments based on your posting: first, the email module now has a BytesFeedparser that will accept a byte stream, which I hope might simplify your patch. Second, it would be very helpful if you could upload your patch as an 'svn diff' against the current py3k trunk (see python.org/dev for details on how to do that). That will make review and application of the patch much much simpler. (This would be true even if more of the code in cgi.py has changed than not.) If you don't want to set up an svn checkout, then a context diff against the copy of cgi.py you started with would be second best. Please post any files individually as .patch or .diff or .txt files...these are preferred in the tracker over .zip files because they can be viewed without downloading.

    006d7b03-f9f3-44da-bb8d-8d7a3e15fee9 commented 13 years ago

    I attach the svn diff file against the present version (generated by Tortoise SVN), hope it's what you expect

    006d7b03-f9f3-44da-bb8d-8d7a3e15fee9 commented 13 years ago

    Please ignore previous post. I worked on the version of cgi.py included in version 3.2b2, and I just realized there were changes commited to the svn repository since this version. I will post the diff file later, but you can always test the files in the zip file

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Pierre, thanks for your work on this. I hope a fix can make it in to 3.2.

    However, while starting Python with -u can help a but, that should not, in my opinion, be requirement to use CGI. Rather, the stdin should be set into binary mode by the CGI processing... it would be helpful if the CGI module either did it automatically, verified it has been done, or at least provided a helper function that could do it, and that appropriate documentation be provided, if it is not automatic. I've seen code like:

    try: # Windows needs stdio set for binary mode.
        import msvcrt
        msvcrt.setmode (0, os.O_BINARY) # stdin  = 0
        msvcrt.setmode (1, os.O_BINARY) # stdout = 1
        msvcrt.setmode (2, os.O_BINARY) # stderr = 2
    except ImportError:
        pass

    and

            if hasattr( sys.stdin, 'buffer'):
                sys.stdin = sys.stdin.buffer

    which together, seem to do the job. For output, I use a little class that accepts either binary or text, encoding the latter:

        class IOMix():
            def __init__( self, fh, encoding="UTF-8"):
                if hasattr( fh, 'buffer'):
                    self._bio = fh.buffer
                    fh.flush()
                    self._last = 'b'
                    import io
                    self._txt = io.TextIOWrapper( self.bio, encoding, None, '\r\n')
                    self._encoding = encoding
                else:
                    raise ValueError("not a buffered stream")
            def write( self, param ):
                if isinstance( param, str ):
                    self._last = 't'
                    self._txt.write( param )
                else:
                    if self._last == 't':
                        self._txt.flush()
                    self._last = 'b'
                    self._bio.write( param )
            def flush( self ):
                self._txt.flush()
            def close( self ):
                self.flush()
                self._txt.close()
                self._bio.close()
    
            sys.stdout = IOMix( sys.stdout, encoding )
            sys.stderr = IOMix( sys.stderr, encoding )

    IOMix may need a few more methods for general use, "print" comes to mind, for example.

    b2aad11c-ff63-4031-9173-99e70afe4445 commented 13 years ago

    Why not simply:

    fp = sys.stdin.detach()
    006d7b03-f9f3-44da-bb8d-8d7a3e15fee9 commented 13 years ago

    Here is the correct diff file

    I also introduced a test to exit from the loop in read_multi() if the total number of bytes read reaches "content-length". It was necessary for my framework, which uses cgi.FieldStorage to read from the attribute rfile defined in socketserver. Without this patch, the program hangs after receiving the number of bytes specified in content length. I work on a Windows XP PC so it might be related to the bug bpo-427345 handled by server.CGIHTTPRequestHandler.run_cgi()

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Regarding the use of detach(), I don't know if it works. Maybe it would. I know my code works, because I have it working. But if there are simpler solutions that are shown to work, that would be great.

    b2aad11c-ff63-4031-9173-99e70afe4445 commented 13 years ago

    Using platform-dependant code seems iffy to me. The detach function on sys.stdin, sys,stdout and sys.stderr is there specifically to switch these streams from text mode to binary mode. See: http://docs.python.org/py3k/library/sys.html#sys.stdin

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Peter, it seems that detach is relatively new (3.1) likely the code samples and suggestions that I had found to cure the problem predate that. While I haven't yet tried detach, your code doesn't seem to modify stdin, so are you suggesting, really...

       sys.stdin = sys.stdin.detach()

    or maybe

       if hasattr( sys.stdin, 'detach'):
            sys.stdin = sys.stdin.detach()

    On the other hand, if detach, coded as above, is equivalent to

       if hasattr( sys.stdin, 'buffer'):
            sys.stdin = sys.stdin.buffer

    then I wonder why it was added. So maybe I'm missing something in reading the documentation you pointed at, and also that at http://docs.python.org/py3k/library/io.html#io.TextIOBase.detach both of which seem to be well-documented if you already have an clear understanding of the layers in the IO subsystem, but perhaps not so well-documented if you don't yet (and I don't).

    But then you referred to the platform-dependent stuff... I don't see anything in the documentation for detach() that implies that it also makes the adjustments needed on Windows to the C-runtime, which is what the platform-dependent stuff I suggested does... if it does, great, but a bit more documentation would help in understanding that. And if it does, maybe that is the difference between the two code fragments in this comment? I would have to experiment to find out, and am not in a position to do that this moment.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    Rereading the doc link I pointed at, I guess detach() is part of the new API since 3.1, so doesn't need to be checked for in 3.1+ code... but instead, may need to be coded as:

    try:
        sys.stdin = sys.stdin.detach()
    except UnsupportedOperation:
        pass
    6818fb22-4d1d-4949-964e-63393d581aa3 commented 13 years ago

    On 02/01/11 10:50 PM, Glenn Linderman wrote:

    Glenn Linderman \v+python@g.nevcal.com\ added the comment:

    Rereading the doc link I pointed at, I guess detach() is part of the new API since 3.1, so doesn't need to be checked for in 3.1+ code... but instead, may need to be coded as:

    try:
        sys.stdin = sys.stdin.detach()
    except UnsupportedOperation:
        pass

    ----------


    Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue4953\


    Hi!

    using "detach" would be great but I'm missing that method here in 2.7! :-)

    erob@localhost:~$ python2.7
    Python 2.7.1 (r271:86832, Jan  2 2011, 10:38:30)
    [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> sys.stdin.detach
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'sys' is not defined
    >>> import sys
    >>> sys.stdin.detach
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'file' object has no attribute 'detach'
    6818fb22-4d1d-4949-964e-63393d581aa3 commented 13 years ago

    i'm thinking this issue is also well connected to:

    http://bugs.python.org/issue1573931

    so a backport of whatever solution comes to 3.2 would be a great addition to Python 2.6 as the very minimum, in order to satisfy minimal backward compatibility!

    Thanks,

    On 02/01/11 10:50 PM, Glenn Linderman wrote:

    Glenn Linderman \v+python@g.nevcal.com\ added the comment:

    Rereading the doc link I pointed at, I guess detach() is part of the new API since 3.1, so doesn't need to be checked for in 3.1+ code... but instead, may need to be coded as:

    try:
        sys.stdin = sys.stdin.detach()
    except UnsupportedOperation:
        pass

    ----------


    Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue4953\


    bitdancer commented 13 years ago

    Etienne: since this is about solving a 3.x specific problem, it will not get backported. bpo-1573931 looks unrelated to me at a quick glance. FYI, you will find that you *do* have detach in 2.7 if you open a file using the io subsystem (import io). Of course, that isn't used for the std files in 2.7.

    Glen: the new IO subsystem is a complete C layer on top of only the most basic of the C runtime stuff. It does handle cross platform issues. Given that, and given that the input to CGI *should* be bytes, I think letting an error raise if the stream is text and detatch isn't available is fine, though we might find we want to catch it to improve the error message with extra context.

    Pierre: yes, that diff is what I was looking for. I hope to have time to look it over later today.

    6818fb22-4d1d-4949-964e-63393d581aa3 commented 13 years ago

    On 03/01/11 09:45 AM, R. David Murray wrote:

    R. David Murray \rdmurray@bitdance.com\ added the comment:

    Etienne: since this is about solving a 3.x specific problem, it will not get backported. bpo-1573931 looks unrelated to me at a quick glance. FYI, you will find that you *do* have detach in 2.7 if you open a file using the io subsystem (import io). Of course, that isn't used for the std files in 2.7.

    Glen: the new IO subsystem is a complete C layer on top of only the most basic of the C runtime stuff. It does handle cross platform issues. Given that, and given that the input to CGI *should* be bytes, I think letting an error raise if the stream is text and detatch isn't available is fine, though we might find we want to catch it to improve the error message with extra context.

    Pierre: yes, that diff is what I was looking for. I hope to have time to look it over later today.

    ----------


    Python tracker \report@bugs.python.org\ \http://bugs.python.org/issue4953\


    Thanks for theses precisions, David.

    So will cgi.FieldStorage still be usable in 3.x using 2.5 semantics ? implementing the size argument in the FieldStorage class would surely be a good fix for WSGI middlewares.

    Either ways (using the new io subsystem) or monkey-patching cgi.FieldStorage so it accepts the size argument could probably helps to resolve memory-usage issues with things like file uploads!

    Regards

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    So then David, is your suggestion to use

    sys.stdin = sys.stdin.detach()

    and you claim that the Windows-specific hacks are not needed in 3.x land? The are, in 2.x land, I have proven empirically, but haven't been able to test CGI forms very well in 3.x because of this bug. I will test 3.x download without the Windows-specific hack, and report how it goes. My testing started with 2.x and has proceeded to 3.x, and it is not always obvious what hacks are no longer needed in 3.x. Thanks for the info.

    bitdancer commented 13 years ago

    Yes, that is my suggestion. Keep in mind that I haven't looked at the patch or run any tests yet :)

    If windows-specific hacks are needed to get the binary stream in 3.x, then IMO that's a bug in IO. As far as I know at the moment there's no such bug :)

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    David, Starting from a working (but hacked to work) version of http.server and using 3.2a1 (I should upgrade to the Beta, but I doubt it makes a difference at the moment), I modified

            # if hasattr( sys.stdin, 'buffer'):
            #     sys.stdin = sys.stdin.buffer
            sys.stdin = sys.stdin.detach()

    and it all kept working.

    Then I took out the

    try: # Windows needs stdio set for binary mode.
        import msvcrt
        msvcrt.setmode (0, os.O_BINARY) # stdin  = 0
        msvcrt.setmode (1, os.O_BINARY) # stdout = 1
        msvcrt.setmode (2, os.O_BINARY) # stderr = 2
    except ImportError:
        pass

    and it quit working. Seems that \r\r\n\r\r\n is not recognized by Firefox as the "end of the headers" delimiter.

    Whether this is a bug in IO or not, I can't say for sure. It does seem, though, that

    1) If Python is fully replacing the IO layers, which in 3.x it seems to claim to, then it should fully replace them, building on a binary byte stream, not a "binary byte stream with replacement of \n by \r\n". The Windows hack above replaces, for stdin, stdout, and stderr, a "binary byte stream with replacement of \n by \r\n" with a binary byte stream. Seems like Python should do that, on Windows, so that it has a chance of actually knowing/controlling what gets generated. Perhaps it does, if started with "-u", but starting with "-u" should not be a requirement for a properly functioning program. Alternately, the IO streams could understand, and toggle the os.O_BINARY flag, but that seems like it would require more platform-specific code than simply opening all Windows files (and adjusting preopened Windows files) during initialization.

    2) The weird CGI processing that exists in the released version of http.server seems to cover up this problem, partly because it isn't very functional, claims "alternate semantics" (read: non-standard semantics), and invokes Python with -u when it does do so. It is so non-standard that it isn't clear what should or should not be happening. But the CGI scripts I am running, that pass or fail as above, also run on Windows 2.6, and particularly, Unix 2.6, in an Apache environment. So I have been trying to minimize the differences to startup code, rather than add platform-specific tweaks throughout the CGI scripts.

    That said, it clearly could be my environment, but I've debugged enough different versions of things to think that the Windows hack above is required on both 2.x and 3.x to ensure proper bytestreams.... and others must think so too, because I found the code by searching on Google, not because I learned enough Python internals to figure it out on my own. The question I'm attempting to address here, is only that 3.x still needs the same hack that 2.x needs, on Windows, to create bytestreams.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    (and I should mention that all the "hacked to work" issues in my copy of http.server have been reported as bugs, on 2010-11-21. The ones of most interest related to this binary bytestream stuff are bpo-10479 and bpo-10480)

    006d7b03-f9f3-44da-bb8d-8d7a3e15fee9 commented 13 years ago

    Other version of the diff file. Nothing changed but I'm afraid I had left duplicate definitions of some methods in the FieldStorage class I follow the discussion on this thread, but would like to know if the patch has been tested and works

    bitdancer commented 13 years ago

    A day late, but I've looked at the patch.

    Now, I'm not all that knowledgeable about CGI, so other people will probably want to chime in here....

    First, I'm uploading a new version of the patch as an svn diff (can be applied to a checkout using 'patch -p0 \<patchfile' from the top level directory of the checkout). This includes Pierre's patch unchanged, and includes changes to test_cgi so that Pierre's patch is tested. Some of the tests fail.

    A couple of the failures have to do with file bodies being returned as binary when previously they were returned as strings. This raises the issue of backward compatibility: if cgi/fieldstorage using applications exist for 3.1, changing this will break them. There may not be a good solution to that problem. But it also may not be possible to fix this in 3.2 at this point (which I seem to have already decided earlier, but I can't now remember why...).

    From looking over the cgi code it is not clear to me whether Pierre's approach is simpler or more complex than the alternative approach of starting with binary input and decoding as appropriate. From a consistency perspective I would prefer the latter, but I don't know if I'll have time to try it out before rc1.

    I also wonder if it would be possible to rewrite FieldStorage to take even better advantage of FeedParser, but if so that would *certainly* not happen before rc1.

    bitdancer commented 13 years ago

    Here is a modified version of the unittest file from unittest.zip that can be run against Pierre's code (it feeds FieldStorage a text stream with a buffer). Running the tests require the data files from the zip.

    They do not pass, in a very different way from the test_cgi failures.

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    R. David said:

    From looking over the cgi code it is not clear to me whether Pierre's approach is simpler or more complex than the alternative approach of starting with binary input and decoding as appropriate. From a consistency perspective I would prefer the latter, but I don't know if I'll have time to try it out before rc1.

    I say: I agree with R. David that an approach using the binary input seems more appropriate, as the HTTP byte stream is defined as binary. Do the 3.2 beta email docs now include documentation for the binary input interfaces required to code that solution? Or could you provide appropriate guidance and review, should someone endeavor to attempt such a solution?

    The remaining concerns below are only concerns; they may be totally irrelevant, and I'm too ignorant of how the code works to realize their irrelevance. Hopefully someone that understands the code can comment and explain.

    I believe that the proper solution is to make cgi work if sys.stdin has already been converted to be a binary stream, or if it hasn't, to dive down to the underlying binary stream, using detach(). Then the data should be processed as binary, and decoded once, when the proper decoding parameters are known. The default encoding seems to be different on different platforms, but the binary stream is standardized. It looks like new code was added to attempt to preprocess the MIME data into chunks to be fed to the email parser, and while I can believe code could be written to do such correctly (but I can't speak for whether this patch code is correct or not), it seems redundant/inefficient and error-prone to do it once outside the email parser, and again inside it.

    I also doubt that self.fp.encoding is consistent from platform to platform). But the HTTP bytestream is binary, and self-describing or declared by HTTP or HTML standards for the parts that are not self-describing. The default platform encoding used for the preopened sys.stdin is not particularly relevant and may introduce mojibake type bugs, decoding errors in the presence of some inputs, and/or platform inconsistencies, and it seems that that is generally where self.fp.encoding, used in various places in this patch, comes from.

    Regarding the binary vs. text issue; when using both binary and text interfaces on output streams, there is the need to do flushing between text and binary writes to preserve the proper sequencing of data in the output. For input, is it possible that mixing text and binary input could result in the binary input missing data that has already been preloaded into the text buffer? Although, for CGI programs, no one should have done any text inputs before calling the CGI functions, so perhaps this is also not a concern... and there probably isn't any buffering on socket streams (the usual CGI use case) but I see the use of both binary and text input functions in this patch, so this may be another issue that someone could explain why such a mix is or isn't a problem.

    bitdancer commented 13 years ago

    Yeah, the documentation for the email stuff is in the dev docs. There's a short summary in the changes section of the email intro with links to the classes and methods that are affected. But basically you call BinaryFeedParser and feed it a binary data, and everything else works just like it did before, including the fact that get_payload() with no arguments returns a string. If there is non-ASCII data in that string and no charset was specified the binary data will get trashed though. To get the binary data out you call it with decode=True.

    I believe you are right that the io module does not support intermixing calls to the main object and its buffer attribute; that's why detach was introduced, I believe. Antoine is nosy on this issue now; he can correct me if I'm wrong.

    So unfortunately I think we do need to come at this starting from binary at the beginning and *decoding* as needed (I believe http uses latin-1 when no charset is specified, but I need to double check that). That still leaves the problem of what if anything to do about existing programs that expect every value in a FieldStorage to be a string. Introduce a new method or parameter for getting the binary version of the value, possibly with some flag indicating that parsing detected non-ascii data?

    merwok commented 13 years ago

    I believe you are right that the io module does not support intermixing calls to the main object and its buffer attribute

    I’ve learned in a recent discussion on web-sig that you can mix them, provided that you call stream.flush() before using stream.buffer.

    pitrou commented 13 years ago

    I believe you are right that the io module does not support intermixing calls to the main object and its buffer attribute; that's why detach was introduced, I believe.

    Writing is ok as long as you call flush() on the text layer when necessary. Reading is not since there's no official way to flush the input buffer on the text layer (assuming some input has been consumed, that is). detach() doesn't do anything special AFAIR.

    (this is all funny in the light of the web-sig discussion where people explain that CGI is such a natural model)

    006d7b03-f9f3-44da-bb8d-8d7a3e15fee9 commented 13 years ago

    I agree that the only consistent solution is to impose that the attribute self.fp must read bytes in all cases, all required conversions should occur inside FieldStorage, using "some" encoding (not sure how to define it...)

    If no argument fp is passed to __init__(), the instance uses the binary version of sys.stdin. In my patch I use sys.stdin.buffer, but it also works if I set it to sys.stdin.detach()

    In all cases the interpreter must be launched with the -u option. As stated in the documentation, the effect of this option is to "force the binary layer of the stdin, stdout and stderr streams (which is available as their buffer attribute) to be unbuffered. The text I/O layer will still be line-buffered.". On my PC (Windows XP) this is required to be able to read all the data stream ; otherwise, only the beginning is read. I tried Glenn's suggestion with mscvrt, with no effect

    I am working on the cgi.py module so that all tests (test_cgi and cgi_test) pass with binary streams. It's almost finished ; I had to adapt the tests, and sometimes fix bugs in them

    Problems in test_cgi.py :

    Problems in cgi_test.py

    I will send the results (diff for new version of cgi + tests) hopefully tomorrow

    b5a9ce10-d67f-478f-ab78-b08d099eb753 commented 13 years ago

    R. David said: (I believe http uses latin-1 when no charset is specified, but I need to double check that)

    See http://bugs.python.org/issue4953#msg121864 ASCII and UTF-8 are what HTTP defines. Some implementations may, in fact, use latin-1 instead of ASCII in some places. Not sure if we want Python CGI to do that or not.

    Thanks for getting the email APIs in the docs... shouldn't have to bug you as much that way :)

    Antoine said: (this is all funny in the light of the web-sig discussion where people explain that CGI is such a natural model)

    Thanks for clarifying the stdin buffering vs. binary issue... it is as I suspected. Maybe you can also explain the circumstances in which "my" Windows code is needed, and whether Python's "-u" does it automatically, but I still believe that "-u" shouldn't be necessary for a properly functioning program, not even a CGI program... it seems like a hack to allow some programs to work without other changes, so might be a useful feature, but hopefully not a required part of invoking a CGI program.

    The CGI interface is "self describing", when you follow the standards, and use the proper decoding for the proper pieces. In that way, it is similar to email. It is certainly not as simple as using UTF-8 everywhere, but compatibility with things invented before UTF-8 even existed somewhat prevents the simplest solution, and then not everything is text, either. At least it is documented, and permits full UNICODE data to be passed around where needed, and permits binary to be passed around where that is needed, when the specs are adhered to.