Open milancurcic opened 7 years ago
I did some digging and the issue seems to stem from hug.input_formats.multipart
, which invokes cgi.parse_multipart
:
form = parse_multipart((body.stream if hasattr(body, 'stream') else body), header_params)
Here, body
is a gunicorn.http.body.Body
instance in my case, which is a file-like object.
cgi.parse_multipart
reads the whole bytestream into memory before returning, which results in the behavior that I described in my original post. The docstring for cgi.parse_multipart
indeed suggests that this is not suitable for large files, and that cgi.FieldStorage
should be used instead:
Parse multipart input.
Arguments:
fp : input file
pdict: dictionary containing other parameters of content-type header
Returns a dictionary just like parse_qs(): keys are the field names, each
value is a list of values for that field. This is easy to use but not
much good if you are expecting megabytes to be uploaded -- in that case,
use the FieldStorage class instead which is much more flexible. Note
that content-type is the raw, unparsed contents of the content-type
header.
XXX This does not parse nested multipart parts -- use FieldStorage for
that.
XXX This should really be subsumed by FieldStorage altogether -- no
point in having two implementations of the same parsing algorithm.
Also, FieldStorage protects itself better against certain DoS attacks
by limiting the size of the data read in one chunk. The API here
does not support that kind of protection. This also affects parse()
since it can call parse_multipart().
I tried to go ahead and replace the call to parse_multipart
with working with a FieldStorage
instance, however I was unable to get any data through it:
#form = parse_multipart((body.stream if hasattr(body, 'stream') else body), header_params)
form = FieldStorage(fp=body,outerboundary=header_params['boundary'])
print(form)
Output from print(form)
:
FieldStorage(None, None, [])
@timothycrosley If you have any suggestion on how I could move forward with implementing this with FieldStorage
I would be happy to work on this feature as I need it.
Update: Using this kind of invocation:
form = FieldStorage(fp=body,outerboundary=header_params['boundary'],\
environ={'REQUEST_METHOD':'POST','CONTENT_TYPE':'MULTIPART/FORM-DATA'})
I was able to get an instance where form.fp
is a file-like object I can stream, however the multipart is not parsed, i.e. other form items as well as the boundary are part of the stream.
My gut-feeling is that cgi.FieldStorage
can be used to get the buffered file object without other form items, but I am not invoking it correctly. Any input much appreciated.
@timothycrosley I had success working this out by incorporating the multipart parser into hug.input_format.multipart
:
@content_type('multipart/form-data')
def multipart(body, **header_params):
"""Converts multipart form data into native Python objects"""
from multipart import MultipartParser
if header_params and 'boundary' in header_params:
if type(header_params['boundary']) is str:
header_params['boundary'] = header_params['boundary'].encode()
parser = MultipartParser(stream=body,boundary=header_params['boundary'],disk_limit=17179869184)
form = dict(zip([p.name for p in parser.parts()],\
[(p.filename,p.file) if p.filename else p.file.read().decode() for p in parser.parts()]))
#form = parse_multipart((body.stream if hasattr(body, 'stream') else body), header_params)
#for key, value in form.items():
# if type(value) is list and len(value) is 1:
# form[key] = value[0]
return form
What multipart.MultipartParser
does is it writes form items to disk if their size exceeds mem_limit
(default 2**20
).
The form
returned by multipart
is still a dict, where the dict values are (filename, <io.BytesIO>)
tuple if key is a file (e.g. file upload), and string otherwise. For example, for a request:
curl -v -H "Content-Type: multipart/form-data" \
-F "foo=hellohello" \
-F "bar=0123456789" \
-F "file=@example_file" \
-X POST ${url}/v1/upload
Hug example code:
@hug.post('/upload',versions=1)
def upload_file(body,request,response):
"""Receives a stream of bytes and writes them to a file."""
print(body)
filename = body['file'][0]
filebody = body['file'][1]
with open(filename,'wb') as f:
chunksize = 4096
while True:
chunk = filebody.read(chunksize)
if not chunk:
break
f.write(chunk)
return
The resulting form
is:
{
'file': ('example_file', <_io.BytesIO object at 0x7fc32549dee8>),
'bar': '0123456789',
'foo': 'hellohello'
}
@milancurcic I currently maintain streaming-form-data which is a multipart/form-data
parser that parses the uploaded content in chunks and lets you stream the underlying file content (not the form body but the actual file uploaded) straight to disk without loading the entire thing in memory. it's written in cython so it's also quite fast.
some others running into a similar issue with flask have found it useful so I thought I might suggest using it for hug as well. hopefully it helps with the issue here too.
cgi module cannot handle POST with multipart/form-data in 3.x https://bugs.python.org/issue4953 linux EOL is "\n", windows is "\r\n" so cgi.FieldStorage in windows has a problem
the multipart parser adapted it!
# multipar.line 232
def _lineiter(self):
#####
for line in lines:
if line.endswith(b"\r\n"):
yield line[:-2], b"\r\n"
elif line.endswith(b"\n"):
yield line[:-1], b"\n"
elif line.endswith(b"\r"):
yield line[:-1], b"\r"
else:
yield line, b""
ulmentflam commented This method works in Linux environments
I am trying to get hug to receive a
multipart/form-data
POST request and stream body in chunks straight to disk. I was able to successfully upload stream a large binary file usingapplication/octet-stream
POST method. Here is my hug method:And here is my curl snippet:
The above works and I'm able to stream upload the file like this because in the
upload_file
function,body
is agunicorn.http.body.Body
instance which I am able to stream straight to disk in chunks.However I need to be able to upload files from browser, which sends a
multipart/form-data
POST request. To emulate this with curl, I do:This time, in hug, the
body
is a dictionary, andbody['file']
is aBytes
instance. However I don't know how to stream this to disk without loading the whole thing in memory first.Is there a way I could obtain the body as a file object that I could stream straight to disk?
Any help much appreciated and thank you for the fantastic work on Hug!