Closed Tronic closed 1 year ago
Thanks for sharing this. After reviewing our implementation, I can definitely see some places we're not parsing according to RFC 9110, such as allowing empty values, quoted keys, and space around the equals.
I did deprecate then remove multiple=True
, so that's no longer an issue, although there's still some cruft to be cleaned up.
I like your idea of replacing the \\"
internal quotes, our regex isn't quite correct because it tries to account for those. If you remember, can you explain how you arrived at firefox_quote_escape
, rather than doing a simple str.replace('\\"', "%22")
?
We also support parameter continuation and value charset syntax from RFC 2231, so I need a little more room to work than the comprehension provides.
The Firefox quote regex apparently is for the case where the quoted value ends with a backslash, e.g. path="c:\"
, where \"
is not an escape sequence for a quote. I did some testing on different browsers back then and this problem probably came up, although it is hard to remember exactly, this much later.
The test provided shows cases where these problems occur with field name containing backslashes, in particular in the last case with Firefox.
I think your quote detection also fails, since \\\\
is used to escape a literal backslash, matching "\\\\"
would incorrectly identify the last two characters as an internal quote, rather than a slash then end of string. There's also an issue with not being able to have the literal string %22
in a value.
In that example you gave, a browser that uses slash escapes should be sending "c:\\\\"
.
Played around with Firefox, Chrome, and curl, and they all substitute %22
. Firefox doesn't use slash encoding. Declaring <!doctype html>
or not doesn't change the behavior. It's also possible to use the literal string %22
, or construct it like %22
, and they all get sent as %22
with no way to distinguish those literal strings from a replaced quote character.
Looks like they're both up to date with the WHATWG HTML Standard https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#multipart-form-data which says to percent encode \n
, \r
and "
in the value.
Sounds like Firefox may have fixed their handling since 2019; it was really problematic. Are you getting different content-disposition from Firefox now vs. the one mentioned with the test? I'll have another look at it, to fix any problems found for the next Sanic framework release coming out this month.
Thanks for investigating this :)
I got the same behavior with Firefox, Chrome, and curl.
Neither browser submits Unicode characters (at least emoji), they substitute HTML character references like "😺.txt"
for πΊ.txt
. But curl does submit Unicode as-is. sigh
I think it's safe to just use the WHATWG HTML Standard for parsing and assume "
will be quoted %22
, and take everything else as-is without further decoding.
Oddly enough, I got unicode characters in my tests (no HTML entity escapes), both for filename and for input value (I suppose for input name as well but I didn't test that), both in Chrome and Firefox. Possibly that depends on document charset (using UTF-8 content-type in my tests).
Sanic now unescapes %22 and %0D%0A (to \n only) but nothing else, as per whatwg, and no longer has special handling for backslashes or "
(Firefox regex removed). If a value contains one of these sequences verbatim, it will be incorrectly unescaped, but I suppose that accidentally hitting %22 in field values would be quite uncommon.
Good call on the encoding, if <meta charset=UTF-8>
is included then browsers send Unicode. Weird that <!doctype html>
on its own doesn't do that, since the HTML Standard says that's the only accepted encoding. I expanded the docs about this header a lot based on this discussion.
My implementation is 2-3 times faster. I have a feeling it's due to the much simpler regex, I now do a "basic" parsing for key=value
, then two much simpler checks for keys with continuation or charset markers.
I didn't remove handling for slash escapes, since RFC 9110 says HTTP headers use those, while multipart form data is described by HTML Standard to use percent. But I figured out a much simpler way to handle slash escapes: "(?:\\\\|\\"|.)*?"
matches surrounding quotes but consumes internal slash escapes.
Check out #2614 for my implementation. You can ignore the continuation and charset handling, I don't think modern clients send that.
I wrote a new version of cgi.parse_header / werkzeug.parse_options_header with better handling for various special cases that unfortunately occur. The new implementation is also twice faster than the two others.
This does not support werkzeug.parse_options_header's multiple=True, which I didn't see being used anywhere anyway (is that for accept-header or what?). This is written for Sanic, but if anyone cares to port it to Werkzeug, the code is there (sorry -- don't have time to do it myself). Adding support for multiple=True is a few lines of code (with proper handling of quoted commas), if deemed useful.
CPython cgi.parse_header tests as well as two others that I added based on testing real browsers all pass with this: