FR: multipart form parser

ITwrx commented 7 months ago

I opened a forum post here, as I thought you may not want Guildenstern to have any opinion on multipart parsing, but multipart parsing in nim seems to be largely up to the http servers/frameworks from what i'm seeing. What are your thoughts on this? Any interest in implementing one, or accepting a PR for it?

thanks

olliNiinivaara commented 7 months ago

See my forum post. In the meantime, please confirm that you are following streamingposttest.nim in your code. If that does not cut it, we might appreciate some small working example from you to get a flying start.

ITwrx commented 7 months ago

hi, i responded in the forum. Here's what i'm doing just to try and get it working.

import guildenstern/[dispatcher, streamingserver]
from os import sleep

proc handleUpload() =
  var body: string
  var name: string
  var trials = 0
  for (state , chunk) in receiveInChunks():
    case state:
      of TryAgain:
        suspend(100)
        trials += 1
        if trials > 100: (closeSocket() ; break)
      of Fail: reply(Http400)
      of Progress: body &= chunk
      of Complete:
        echo body        
        reply(Http204)
  sleep(100)
  shutdown()

proc onRequest() =
  let html = """<!doctype html><title>StreamCtx</title><body>
  <form action="/upload" method="post" enctype="multipart/form-data" accept-charset="utf-8">
  <label for="msg">Msg:</label>
  <input type="text" id="msg" name="msg"><br>
  <label for="file">File:</label>
  <input type="file" id="file" name="file"><br>
  <input type="submit">"""

  if startsUri("/upload"): handleUpload()
  else: reply(Http200, html)

let server = newStreamingServer(onRequest)
server.start(5050)
joinThread(server.thread)

I was confused by the fact that the streaming server example and streamingposttest were different. If streamingposttest uses the receiveInChunks() iterator, and that is the preferred way, then why isn't that used in the streamingserver example or why aren't the streamingserver example and the streamingposttest implemented in the same way? I try to get as much info as possible from these examples and tests, even if it seems like i'm not reading them. :) I'm also still not clear on what you meant when you said [1]:

yes, upload from client to server works with StreamingServer.

So does that mean "upload only" and i need to add multipart parsing lib/code, or i'm supposed to be able to grab the inputs and file from the mime doc/string body already, and i just don't know how?

From what i can tell, neither example fully show how to "upload a file". By which i mean, post a file from a form, parse it and save it in the recommended way. I would expect that would mean saving chunks of the file to temp files on disk until the upload is done then save them to a real file or similar, assuming the file was over some configurable size threshold, otherwise save in one go. When i saw "chunks" i was thinking that's what the example would show. Maybe there will be no "recommended way" to save the file and that is fine, but i'm still not clear on the parsing part.

https://forum.nim-lang.org/t/10433#69528

Thanks for any clarification you can provide.

olliNiinivaara commented 7 months ago

Now I am starting to get it...

The misunderstanding is due to the fact that "a chunk" means here simply any packet of bytes that the OS can offer. receiveMultipart() is a very low level procedure for asking one such chunk, and receiveInChunks() as just a bit higher level iterator for getting those chunks as a stream. The stream of byte chunks then might represent form/multipart -type data (and then again not), but the relation between the chunks and the form parts is nontrivial: even the boundary marker between the parts might end up in different chunks.

I have tried to include some super elementary support for form/multipart format, just for cases when there is only one part (like one HTTP request or one file upload). Your use case is already over the current possibilities. But I hear you. This has to be supported somehow. Solution presumably requires keeping a window of recently received data, and thus this is a bit challenging to design. I will find a slot for working on this in the next couple of weeks.

ITwrx commented 7 months ago

The misunderstanding is due to the fact that "a chunk" means here simply any packet of bytes that the OS can offer. receiveMultipart() is a very low level procedure for asking one such chunk, and receiveInChunks() as just a bit higher level iterator for getting those chunks as a stream. The stream of byte chunks then might represent form/multipart -type data (and then again not), but the relation between the chunks and the form parts is nontrivial:

ahh, ok.

even the boundary marker between the parts might end up in different chunks.

wow. :smile:

I have tried to include some super elementary support for form/multipart format, just for cases when there is only one part (like one HTTP request or one file upload). Your use case is already over the current possibilities. But I hear you. This has to be supported somehow. Solution presumably requires keeping a window of recently received data, and thus this is a bit challenging to design.

Oh, ok. Yeah, i already have a form with two video files and various other inputs (text, radios, etc), just fyi.

I will find a slot for working on this in the next couple of weeks.

That's great news! Feel free to provide any donation method info. You could email it directly to me using the contact form at itwrx.org if you'd prefer. Otherwise, please accept my enthusiastic moral support. :smile:

Thanks again.

olliNiinivaara commented 7 months ago

This gist includes a proof of concept that a streaming multipart server is doable, and a familiar test for it:

https://gist.github.com/olliNiinivaara/bd672a5e3d2c5c17691ea87846b82edf

The thing here is the new PartStart state, that gets called (with an empty chunk) whenever the part changes. Please, try this out with some more complicated test cases and give feedback whether the overall design works for you.

For now, I have to leave developing a parser for the multipart headers for others. It is more tedious than challenging task, and solutions to this should already exist. PRs welcome.

I shall be refactoring this towards something more publishable, as time permits.

ITwrx commented 7 months ago

This gist includes a proof of concept that a streaming multipart server is doable, and a familiar test for it:

https://gist.github.com/olliNiinivaara/bd672a5e3d2c5c17691ea87846b82edf

The thing here is the new PartStart state, that gets called (with an empty chunk) whenever the part changes. Please, try this out with some more complicated test cases and give feedback whether the overall design works for you.

I'll give it a shot soon. thanks.

For now, I have to leave developing a parser for the multipart headers for others. It is more tedious than challenging task, and solutions to this should already exist. PRs welcome.

Do you have any idea (from a high level) how you might approach this if writing one from scratch for Guildenstern?

I shall be refactoring this towards something more publishable, as time permits.

sure, thanks for the sneak peek. :smile:

olliNiinivaara commented 7 months ago

let's see...

Parsing the part can be divided to three separate tasks: getting the header, splitting the header fields, and parsing individual header values.

getting the header

when PartStart is received, (previous body is processed and) you are starting to get the next header
keep adding chunks to a header string, until empty line (CRLFCRLF) is received
keep everything before empty line in header string, and move everything after empty line to new body

splitting the header data into a list of header field - header value -pairs

use parseHeaders as inspiration (it can later be refactored to a generic version suitable for both these cases): https://github.com/olliNiinivaara/GuildenStern/blob/51df0af6a52f7d29a40156b1b5f1ac1085f48add/guildenstern/httprequest.nim#L147

parsing individual header values

Guildenstern should eventually support getting out the directives of the Content-Disposition header, parsing other headers types (like Content-Type) we can maybe leave for the end user to struggle with.

ITwrx commented 6 months ago

I should have brushed up on these specs before starting any of this, but in my defense, i didn't really remember/know what i was looking for... :smile: When you say "chunks" in the receiveInChunks() proc you mean chunks from the OS network stack, while i meant these chunks.

It also seems that chunked transfer has to be decided/initiated from the client side.

It looks like the current Guildenstern multipart server example expects a regular multipart with a Content-Length header, and not chunked transfer encoding where the whole size is not known; only each chunk's size. I would like to use chunked transfer encoding for larger files like videos.

when you say:

## Receives a http request in chunks, yielding the state of operation and a possibly received new chuck on every iteration.
## With this, you can receive POST data without worries about main memory usage.

I guess you mean it will handle the multipart request in it's own thread with it's own heap, so the other server threads are not affected, but overall app ram is still being used for the traffic data chunks, and increases until the whole file is uploaded? That ram usage was what i was trying to avoid with other http servers. It would probably be fine for some image uploads, but for videos, not so much. Am i mistaken, or would using chunked transfer encoding from the client side, and a temp file on disk on the server side, allow you to only use ram equivalent to each chunk, without it accumulating for the whole file? Shouldn't that just be handled by Guildenstern automatically if chunked transfer encoding is being used from the client side? IOW, shouldn't Guildenstern handle both regular multipart with Content-Length, and chunked transfer encoding? Or am i still missing something?

I also wonder where and how server-side file size and file type validation would be handled when chunk transfer encoding is used... even if validation is a user framework-level feature, i still wonder how/if it needs to be considered as part of the design, and possibly facilitated in some way.

Thanks

ITwrx commented 6 months ago

and i guess you're planning on staying with http1_1 for the foreseeable future? http2 evidently does this streaming/chunk stuff differently, as you probably already know.

olliNiinivaara commented 6 months ago

The original httpserver uses a buffer with maximum size, and only offers the request to you when it is received as a whole. Very convenient, but you need to set the maximum size in advance, and too large buffers will eat all your main memory.

But any streaming server will serve a request to you in parts, without first buffering the whole request. This way you can process in principle requests of unlimited length, but you need a way to process the parts online (for example, append them to a file).

Now, Multipartserver is designed to serve parts of the request to you as whole, so it is a mix between very low level streaming server with no buffering, and a very high level server that tries to buffer everything.

Multipart encoding is used a lot, so it makes sense to support it. Chunked transfer encoding that you are pointing to is more rarer. As you mention, it may even be obsolete now that transfer between a browser and your reverse proxy is HTTP/3.

But you can server your large video files just as well with multipart encoding. Use that.

Only case where chunked transfer encoding might make sense is sending dynamic streams (no files at all). But as far as I know, browsers do not support such uploads, but you need to know the Content-Length. But when multipart server works, it should be easy to make a similar version for chunked transfer encoding. But before that, you really should inspect whether it would useful to you in the end.

There also seems to be an upcoming feature that will allow browsers to support dynamic uploads: https://caniuse.com/mdn-api_request_request_request_body_readablestream

Also, HTTP standard does not strictly require using the Content-Length at all. Maybe a streaming server that just does not use it would be enough for you. This requires that you are coding your own client (and not connecting with a browser), so that you can implement your very own content encoding standard.

But anyway, let's first make the multipart server nice to work with, that is, a server that supports sending many data parts, including but not limited to massive-sized files.

ITwrx commented 6 months ago

Now, Multipartserver is designed to serve parts of the request to you as whole, so it is a mix between very low level streaming server with no buffering, and a very high level server that tries to buffer everything. ... But you can server your large video files just as well with multipart encoding. Use that.

Multipart is also simpler on the client side, but how is the ram handled in this situation, assuming a large file upload, for instance?

Chunked transfer encoding (or similar newer method) also allows the server operator to keep reasonable upload size/mem limits while still allowing large file uploads. I'm all for just using multipart if i'm not allowing one user to dos the whole server.

olliNiinivaara commented 6 months ago

Fear of dos is unfounded. Before the streaming starts, you can inspect the Content-Length and decline too large files. The streaming ends when that amount is reached, a malicious client cannot exceed that.
You can set the maximum buffer size for the individual chunks (see lines 111-114 in gist, due to historical reasons the buffer size is called server.maxrequestlength). If you have very small amount of ram (server running on embedded system or something), you set this buffer to a tiny size. When a chunk is received, you process it (append to file or whatever) and the buffer is then reused. This way you can process file sizes that far exceed the amount of ram. And you can still close the socket any time during the streaming, if some resource limit is in danger.

ITwrx commented 6 months ago

When a chunk is received, you process it (append to file or whatever) and the buffer is then reused. This way you can process file sizes that far exceed the amount of ram. And you can still close the socket any time during the streaming, if some resource limit is in danger.

Great, thanks for the info. This was the part i wasn't sure about.

ITwrx commented 6 months ago

I don't understand the name and body in the gist. If i do this:

(starts at line 20 of test.nim, from your gist.)

of Complete:
        #echo "name: ", name
        #echo "body: ", body   
        if name.len > 0:
          let nameLinesSeq = splitlines(name)
          for nameLineSeq in nameLinesSeq:
            if nameLineSeq.len > 0:
              echo "nameLineSeq: " & nameLineSeq
        if body.len > 0:
          let bodyLinesSeq = splitlines(body)
          for bodyLineSeq in bodyLinesSeq:
            if bodyLineSeq.len > 0:
              echo "bodyLineSeq: " & bodyLineSeq

I get:

nameLineSeq: Content-Disposition: form-data; name="file1"; filename=""
nameLineSeq: Content-Type: application/octet-stream
nameLineSeq: Content-Disposition: form-data; name="msg"
nameLineSeq: the msg is me
bodyLineSeq: Content-Disposition: form-data; name="subject"
bodyLineSeq: the subject is me
bodyLineSeq: Content-Disposition: form-data; name="file2"; filename=""
bodyLineSeq: Content-Type: application/octet-stream

(my form now has two of each field type for testing)

It's the same data, but just for different fields. I don't understand what the delineation is between name and body in the gist. why isn't it just var part: string and then add the chunks to part?

olliNiinivaara commented 6 months ago

Come on now, the gist tries to faithfully replicate your own example. Don't ask me, ask yourself why isn't it just var part: string. You have yourself decided to call the parts "name" and "body" on the back-end, which you call "msg" and "file" on the front-end. You cannot add fields to the front-end and expect that the back-end continues to make sense. The gist is just a PoC based on your example, nothing more.

ITwrx commented 6 months ago

Sorry for the noise. i didn't realize i had changed that (name, body) from your streamingtest.nim and streamingposttest.nim at some point. I'll try to remember to double check for my changes in the future.

ITwrx commented 5 months ago

I've been working on this, and i have a seemingly working basic/testing implementation for parsing the multipart/form-data and saving the uploaded files, but my saved files are always corrupted. I found they are slightly larger that the source upload file. Aprox. 10 bytes per 1 MB larger. Smaller text files and pdfs will open and display fine (even though they are a few bytes larger than source file), but videos get corrupted and the larger they are, the worse it gets. Smaller ones will play with artifacts, but larger videos won't even play. I thought maybe it was how i was saving the chunks to tempfile or moving to permanent file, but after testing different things with the saving, and finding no difference in behavior, i just echo'd out a raw chunk (as in):

for (state , chunk) in receiveParts(): echo chunk (simplified version)

and copied it to txt file and ran diff on it and the source file, and there are (seemingly random) duplicated characters/bytes in the chunks (unmodified by my code), that do not exist in the source file, and therefore in my reconstituted, saved files.

I don't know if there's something that could be done in the multipart gist, or it's upstream in the std lib, or what.

Thanks

ITwrx commented 5 months ago

and this is what i came up with/am using for parsing the multipart/form-data so far.

import guildenstern/[dispatcher, multipartserver]
import std/[strutils, tempfiles, files, paths]

proc handleUpload() =  
  type
    Part = object
      fieldName, textFieldValue, contentType, fileName, fileBody, tempFilePath: string
      tempFileHandle: File
  var part: Part    
  var parts: seq[Part]   
  var acceptedFileTypes = @["image/png", "image/jpg", "image/jpeg", "application/pdf", "video/webm", "video/mp4", "text/plain"]
  var trials = 0
  if not startReceiveMultipart():
    reply(Http500)
    return
  for (state , chunk) in receiveParts():
    var fieldName, textFieldValue, contentType, fileName, fileBody, prefix, varChunk: string
    var chunkLines, stringChunkLineSeq = @[""]    
    var createTempFileResult: tuple[cfile: File, path: string]
    case state:
      of TryAgain:
        trials += 1
        if trials > 100: (closeSocket() ; break)
      of Fail: reply(Http400)
      of PartStart: continue
      of Progress:
        chunkLines = chunk.splitLines
        for chunkLine in chunkLines:
          if chunkLine.len > 0:
            stringChunkLineSeq.add(chunkLine)
        stringChunkLineSeq.delete(0)
        #check if this is first chunk of a part or not.
        if "Content-Disposition" in stringChunkLineSeq[0]:
          #could be text input, or file input with no file.
          if stringChunkLineSeq.len() >= 2: 
            if "filename=" in stringChunkLineSeq[0]:
              fieldName = stringChunkLineSeq[0].split(';')[1].split("name=")[1][1..^2]
              if fieldName.len() > 0:
                part.fieldName = fieldName
              else:
                part.fieldName = ""                
              fileName = stringChunkLineSeq[0].split(';')[2].split("filename=")[1][1..^2]
              if fileName.len() > 0:
                part.fileName = fileName
              else:
                part.fileName = ""
              contentType = stringChunkLineSeq[1][14..^1]
              if contentType.len() > 0:
                part.contentType = contentType
              else:
                part.contentType = ""
              part.textFieldValue = ""
            else:
              fieldName = stringChunkLineSeq[0].split("name=")[1][1..^2]
              part.fieldName = fieldName
              textFieldValue = stringChunkLineSeq[1]
              part.textFieldValue = textFieldValue
              part.fileName = ""
              part.contentType = ""
          if stringChunkLineSeq.len() == 3:
            fileBody = stringChunkLineSeq[2]
            part.fileBody = fileBody

          elif stringChunkLineSeq.len() > 3:
            varChunk = chunk
            prefix = "\r\n" & stringChunkLineSeq[0] & "\r\n" & stringChunkLineSeq[1] & "\r\n\r\n"
            varChunk.removePrefix(prefix)
            part.fileBody = varChunk
          else:
            part.fileBody = ""

          if part.fileBody.len() > 0:
            if part.contentType in acceptedFileTypes:
              try:
                createTempFileResult = createTempFile("upload_", "_muhapp.tmp")
                part.tempFilePath = createTempFileResult.path
                part.tempFileHandle = createTempFileResult.cfile
                try:
                  part.tempFileHandle.write(part.fileBody)
                except IOError:
                  echo "could not save fileBody to tmp file"
              except OSError:
                echo "could not create temp file" 
            else:
              echo "contentType not accepted"
          else:
            part.tempFilePath = ""

          parts.add(part)
        #not first chunk. more chunks of a fileBody.
        else:
          varChunk = chunk          
          try:
            part.tempFileHandle.write(varChunk)
          except IOError:
            echo "could not add chunk to tmp file"
      of Complete:
        for part in parts:
          if part.tempFilePath.len() > 0:
            try:              
              close part.tempFileHandle              
              let uploadDir = Path "/var/www/muhapp/uploads/" & part.filename
              let source = Path part.tempFilePath
              try:
                files.moveFile(source, uploadDir)
              #had to use just "Exception" here to not cause "raises" error. Don't remember what the deal is with this...
              except Exception as e:
                echo e.msg
                #echo "Could not move uploaded file to uploads directory"
            except IOError:
              echo "could not close tmp file"            

        reply(Http204)
  #shutdown()

proc onRequest() =
  let html = """<!doctype html><title>StreamCtx</title><body>
  <form action="/upload" method="post" enctype="multipart/form-data" accept-charset="utf-8">
  <label for="file1">File:</label>
  <input type="file" id="file1" name="file1"><br>
  <label for="subject">Subject:</label>
  <input type="text" id="subject" name="subject"><br>
  <label for="msg">Msg:</label>
  <input type="text" id="msg" name="msg"><br>
  <label for="file2">File:</label>
  <input type="file" id="file2" name="file2"><br>
  <input type="submit">"""

  if startsUri("/upload"): handleUpload()
  else: reply(Http200, html)

let server = newMultipartServer(onRequest)
server.start(5050)
joinThread(server.thread)

olliNiinivaara commented 5 months ago

I take your replies as a corroboration that the basic design is working and achieves what you need. I will now start to fix errors and refactor the code. There is a version7 branch that will receive the work-in-progress. Thanks for your spadework and bug reports. In this case there is no "upstream", all the bugs can be found in the gist (or you code). Fortunately that is not much of code, so all I need is some time and some determination...

ITwrx commented 5 months ago

I take your replies as a corroboration that the basic design is working and achieves what you need.

yes, it seems to be working well, besides the duplicated characters/bytes. Ram usage appeared to stay low during large file uploads in my limited testing. I do notice that sometimes refreshing the form page of the gist (or maybe my version of it), the page never loads (or takes a looong time?). Maybe it's just after previous uploads or after having been running for a while? I have to stop it and revisit directly by hitting enter on the address bar instead of refresh button. I don't know what that's about. Just thought you should know in case there's some server/gist issue and/or your wanted to try that during your dev. I have not investigated at all, so sorry if this is not valid info in some way and/or feel free to ignore until an actual report is submitted down the line. Just a "heads up".

I will now start to fix errors and refactor the code. There is a version7 branch that will receive the work-in-progress.

sweet!

Thanks for your spadework and bug reports.

Sure! I'm glad to help. I'd never done this before, so it wasn't obvious to me how to approach/test. :)

In this case there is no "upstream", all the bugs can be found in the gist (or you code).

That's good to hear. I was worried there was something going on with nim, and/or it's interaction with the network stack.

Fortunately that is not much of code, so all I need is some time and some determination...

Well, i'm rooting for you! If there's anything i can do, please let me know.

olliNiinivaara commented 5 months ago

OK, switch to branch version7, use multiparttest.nim from the examples folder as guidance, and try things out. I especially would like to know whether duplicating characters is still an issue. If not, the multipart -thingy should be now quite stable. But I still have to rewrite documentation and improve some other features as well, until the version 7 becomes publishable.

ITwrx commented 5 months ago

wow! That was fast. I thought i was going to get a longer mini-vacation. :smile: I'll try it out soon and let you know. Thanks!

olliNiinivaara commented 5 months ago

Please, checkout back to master. My finger is now on the "make new release" button, waiting for your approval.

ITwrx commented 5 months ago

I haven't checked out and tested the new master yet, but i tested the v7 multipart yesterday, and it seems to be working perfectly. My uploads were exactly the same as the source files, byte for byte. It looks like you baked in some/all of the functionality i was trying to accomplish in my implementation and in the process made Guildenstern's multipart feature more user-friendly too. Very nice! It doesn't appear to be using ANY extra ram for large uploads(GBs), so i was still looking into that just trying to see it use the ram for the chunks. :smile:

Your redesigning of some elements of Guildenstern for v7 to take all the new changes into account looks like a nice improvement, as well. Feel free to release v7 when you're ready. Thanks though! I will probably just port my application to it, and continue development, as my way of v7 testing. Thanks for your work! The Nim web ecosystem finally has proper production-grade file upload and a solid base for more "real world" apps and more elaborate, or sophisticated, web frameworks. Very cool!

olliNiinivaara commented 5 months ago

Time to share our efforts with the community, then.

olliNiinivaara / GuildenStern