mscdex / busboy

A streaming parser for HTML form data for node.js
MIT License
2.84k stars 213 forks source link

Spaces in Content-Type boundary parameters results in Error: Malformed content type #321

Closed roblperry closed 1 year ago

roblperry commented 1 year ago

Sending a boundary containing whitespace (e.g. Content-Type,multipart/form-data;boundary=HyperLog -1663179571660) will cause Busboy to error.

Error: Malformed content type
    at getInstance (/layers/google.nodejs.yarn/yarn_modules/node_modules/busboy/lib/index.js:9:11)
    at module.exports (/layers/google.nodejs.yarn/yarn_modules/node_modules/busboy/lib/index.js:56:10)

I manually confirmed this by calling parseContentType directly.

const samples = [
  'multipart/form-data;boundary=TestCase-1663179571660',
  'multipart/form-data;boundary=TestCase-1663179571660 ',
  'multipart/form-data;boundary=testcase-1663179571660',
  'multipart/form-data;boundary=testcase-1663179571660 ',
  'multipart/form-data;boundary=TestCase 1663179571660',
  'multipart/form-data;boundary=testcase 1663179571660'
]

for (let sample of samples) {
  const obj = parseContentType(sample);
  console.log(`${sample} -> ${JSON.stringify(obj)}`)
}

resulting in

multipart/form-data;boundary=TestCase-1663179571660 -> {"type":"multipart","subtype":"form-data","params":{"boundary":"TestCase-1663179571660"}}
multipart/form-data;boundary=TestCase-1663179571660  -> {"type":"multipart","subtype":"form-data","params":{"boundary":"TestCase-1663179571660"}}
multipart/form-data;boundary=testcase-1663179571660 -> {"type":"multipart","subtype":"form-data","params":{"boundary":"testcase-1663179571660"}}
multipart/form-data;boundary=testcase-1663179571660  -> {"type":"multipart","subtype":"form-data","params":{"boundary":"testcase-1663179571660"}}
multipart/form-data;boundary=TestCase 1663179571660 -> undefined
multipart/form-data;boundary=testcase 1663179571660 -> undefined

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type says:

boundary For multipart entities the boundary directive is required. The directive consists of 1 to 70 characters from a set of characters (and not ending with white space) known to be very robust through email gateways. It is used to encapsulate the boundaries of the multiple parts of the message. Often, the header boundary is prepended with two dashes and the final boundary has two dashes appended at the end.

https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html says:

The only mandatory parameter for the multipart Content-Type is the boundary parameter, which consists of 1 to 70 characters from a set of characters known to be very robust through email gateways, and NOT ending with white space. (If a boundary appears to end with white space, the white space must be presumed to have been added by a gateway, and should be deleted.) It is formally specified by the following BNF:

boundary := 0*69<bchars> bcharsnospace 

bchars := bcharsnospace / " " 

bcharsnospace :=    DIGIT / ALPHA / "'" / "(" / ")" / "+"  / "_"  / "," / "-" / "." / "/" / ":" / "=" / "?" 

Thus I feel that supporting spaces in boundaries would be an improvement.

mscdex commented 1 year ago

If your boundary contains whitespace, it needs to be properly quoted (like any other parameter), otherwise the parser has no idea whether the next token is part of the previous parameter or a new parameter.

roblperry commented 1 year ago

https://github.com/roblperry/busboy/commit/956eb8dd7e005994c56e2b741253ea4d4063a195 seems to work for me, but I am not that terribly familiar with the standard. So, I recognize that it would be easy for me to be missing important side effects of my changes.

For that matter, I now see that I missed this part: "If a boundary appears to end with white space, the white space must be presumed to have been added by a gateway, and should be deleted." But, I should be able to solve that if this general direction is agreeable.

mscdex commented 1 year ago

Whitespace after an unquoted parameter value should be discarded. Whitespace at the beginning or in the middle of an unquoted parameter value would be a syntax error.

The easiest solution to avoid having to worry about any of these problems is to just always use quoted values.

roblperry commented 1 year ago

I agree that the client library in question would be improved by modifying it to avoid placing a space in the boundary. I can't imagine the benefit of having said space. I also don't want to take up too much of your time. However, I would very much like to point the maintainers of the client library at a standard backing my case, otherwise I am just going to be stuck stating that not all server side implementations support spaces inside the boundary, which isn't a terrible argument. I'd just rather have something stronger.

Clearly, I've looked. Am I looking at the wrong spec? Do you have anything for me on that front?

Either way, thank you for busboy. I rather enjoyed crawling around in the performance optimized parser. I don't think I've played with that style of code since the 90s.

mscdex commented 1 year ago

It's an HTTP header, so you would be referencing the ABNF from RFC 7231 (and RFC 7230 for some reused, general types), starting with the Content-Type token.

roblperry commented 1 year ago

Excellent. Thank you, sir.

For anyone else who happens across this RFC 7231 lead me to:

parameter = token "=" ( token / quoted-string )

where

token = <token, see [RFC7230], Section 3.2.6>

which aligns with what mscdex has been trying to tell me.

     token          = 1*tchar

     tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                    / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~"
                    / DIGIT / ALPHA