hugapi / hug

Embrace the APIs of the future. Hug aims to make developing APIs as simple as possible, but no simpler.
MIT License
6.86k stars 387 forks source link

How to stream upload a file using multipart/form-data #474

Open milancurcic opened 7 years ago

milancurcic commented 7 years ago

I am trying to get hug to receive a multipart/form-data POST request and stream body in chunks straight to disk. I was able to successfully upload stream a large binary file using application/octet-stream POST method. Here is my hug method:

@hug.post('/upload',versions=1,requires=cors_support)
def upload_file(body,request,response):
    from gunicorn.http.body import Body
    """Receives a stream of bytes and writes them to a file
    with the name provided in the request header."""
    # if Content-type is multipart/form-data,
    # body is a dictionary. This will load the
    # whole thing into memory before writing
    # to disk.
    if type(body) == dict:
        filename = body['filename']
        filebody = body['file']
        with open(filename,'wb') as f:
            f.write(filebody)
    # if Content-type is application/octet-stream
    # body is a gunicorn.http.body.Body. This
    # is a file-like object that is streamed and
    # written in chunks.
    elif type(body) == Body:
        filename = request.headers['FILENAME']
        chunksize = 4096
        with open(filename,'wb') as f:
            while True:
                chunk = body.read(chunksize)
                if not chunk:
                    break
                f.write(chunk)
    return

And here is my curl snippet:

url=http://localhost:8000/v1/upload
filename=largefile.dat

curl -v -H "filename: $filename" \
        -H "Content-Type: application/octet-stream" \
        --data-binary @$filename -X POST $url

The above works and I'm able to stream upload the file like this because in the upload_file function, body is a gunicorn.http.body.Body instance which I am able to stream straight to disk in chunks.

However I need to be able to upload files from browser, which sends a multipart/form-data POST request. To emulate this with curl, I do:

url=http://localhost:8000/v1/upload
filename=largefile.dat

curl -v -H "filename: $filename" \
        -H "Content-Type: multipart/form-data" \
        -F "filename=$filename" \
        -F "file=@$filename;type=application/octet-stream" \
    -X POST $url

This time, in hug, the body is a dictionary, and body['file'] is a Bytes instance. However I don't know how to stream this to disk without loading the whole thing in memory first.

Is there a way I could obtain the body as a file object that I could stream straight to disk?

Any help much appreciated and thank you for the fantastic work on Hug!

milancurcic commented 7 years ago

I did some digging and the issue seems to stem from hug.input_formats.multipart, which invokes cgi.parse_multipart:

form = parse_multipart((body.stream if hasattr(body, 'stream') else body), header_params)

Here, body is a gunicorn.http.body.Body instance in my case, which is a file-like object.

cgi.parse_multipart reads the whole bytestream into memory before returning, which results in the behavior that I described in my original post. The docstring for cgi.parse_multipart indeed suggests that this is not suitable for large files, and that cgi.FieldStorage should be used instead:

    Parse multipart input.

    Arguments:
    fp   : input file
    pdict: dictionary containing other parameters of content-type header

    Returns a dictionary just like parse_qs(): keys are the field names, each
    value is a list of values for that field.  This is easy to use but not
    much good if you are expecting megabytes to be uploaded -- in that case,
    use the FieldStorage class instead which is much more flexible.  Note
    that content-type is the raw, unparsed contents of the content-type
    header.

    XXX This does not parse nested multipart parts -- use FieldStorage for
    that.

    XXX This should really be subsumed by FieldStorage altogether -- no
    point in having two implementations of the same parsing algorithm.
    Also, FieldStorage protects itself better against certain DoS attacks
    by limiting the size of the data read in one chunk.  The API here
    does not support that kind of protection.  This also affects parse()
    since it can call parse_multipart().

I tried to go ahead and replace the call to parse_multipart with working with a FieldStorage instance, however I was unable to get any data through it:

    #form = parse_multipart((body.stream if hasattr(body, 'stream') else body), header_params)
    form = FieldStorage(fp=body,outerboundary=header_params['boundary'])
    print(form)

Output from print(form):

FieldStorage(None, None, [])

@timothycrosley If you have any suggestion on how I could move forward with implementing this with FieldStorage I would be happy to work on this feature as I need it.

milancurcic commented 7 years ago

Update: Using this kind of invocation:

form = FieldStorage(fp=body,outerboundary=header_params['boundary'],\
                    environ={'REQUEST_METHOD':'POST','CONTENT_TYPE':'MULTIPART/FORM-DATA'})

I was able to get an instance where form.fp is a file-like object I can stream, however the multipart is not parsed, i.e. other form items as well as the boundary are part of the stream.

My gut-feeling is that cgi.FieldStorage can be used to get the buffered file object without other form items, but I am not invoking it correctly. Any input much appreciated.

milancurcic commented 7 years ago

@timothycrosley I had success working this out by incorporating the multipart parser into hug.input_format.multipart:

@content_type('multipart/form-data')
def multipart(body, **header_params):
    """Converts multipart form data into native Python objects"""
    from multipart import MultipartParser
    if header_params and 'boundary' in header_params:
        if type(header_params['boundary']) is str:
            header_params['boundary'] = header_params['boundary'].encode()
    parser = MultipartParser(stream=body,boundary=header_params['boundary'],disk_limit=17179869184)
    form = dict(zip([p.name for p in parser.parts()],\
        [(p.filename,p.file) if p.filename else p.file.read().decode() for p in parser.parts()]))
    #form = parse_multipart((body.stream if hasattr(body, 'stream') else body), header_params)
    #for key, value in form.items():
    #    if type(value) is list and len(value) is 1:
    #        form[key] = value[0]
    return form

What multipart.MultipartParser does is it writes form items to disk if their size exceeds mem_limit (default 2**20).

The form returned by multipart is still a dict, where the dict values are (filename, <io.BytesIO>) tuple if key is a file (e.g. file upload), and string otherwise. For example, for a request:

    curl -v -H "Content-Type: multipart/form-data" \
            -F "foo=hellohello" \
            -F "bar=0123456789" \
            -F "file=@example_file" \
            -X POST ${url}/v1/upload

Hug example code:

@hug.post('/upload',versions=1)
def upload_file(body,request,response):
    """Receives a stream of bytes and writes them to a file."""
    print(body)
    filename = body['file'][0]
    filebody = body['file'][1]
    with open(filename,'wb') as f:
        chunksize = 4096
        while True:
            chunk = filebody.read(chunksize)
            if not chunk:
                break
            f.write(chunk)
    return

The resulting form is:

{
  'file': ('example_file', <_io.BytesIO object at 0x7fc32549dee8>), 
  'bar': '0123456789', 
  'foo': 'hellohello'
}
siddhantgoel commented 5 years ago

@milancurcic I currently maintain streaming-form-data which is a multipart/form-data parser that parses the uploaded content in chunks and lets you stream the underlying file content (not the form body but the actual file uploaded) straight to disk without loading the entire thing in memory. it's written in cython so it's also quite fast.

some others running into a similar issue with flask have found it useful so I thought I might suggest using it for hug as well. hopefully it helps with the issue here too.

Panlq commented 3 years ago

1

cgi module cannot handle POST with multipart/form-data in 3.x https://bugs.python.org/issue4953 linux EOL is "\n", windows is "\r\n" so cgi.FieldStorage in windows has a problem

the multipart parser adapted it!

# multipar.line 232     
def _lineiter(self):
  #####
  for line in lines:
      if line.endswith(b"\r\n"):
          yield line[:-2], b"\r\n"
      elif line.endswith(b"\n"):
          yield line[:-1], b"\n"
      elif line.endswith(b"\r"):
          yield line[:-1], b"\r"
      else:
          yield line, b""

2

ulmentflam commented This method works in Linux environments