parse_options_header quote handling

Tronic commented 5 years ago

I wrote a new version of cgi.parse_header / werkzeug.parse_options_header with better handling for various special cases that unfortunately occur. The new implementation is also twice faster than the two others.

import re
import typing

Options = typing.Dict[str, str]  # key=value fields in various headers

token, quoted = r"([\w!#$%&'*+\-.^_`|~]+)", r'"([^"]*)"'
parameter = re.compile(fr";\s*{token}=(?:{token}|{quoted})", re.ASCII)
firefox_quote_escape = re.compile(r'\\"(?!; |\s*$)')

# RFC's quoted-pair escapes are mostly ignored by browsers. Chrome, Firefox and
# curl all have different escaping, that we try to handle as well as possible,
# even though no client espaces in a way that would allow perfect handling.

def parse_content_header(value: str) -> typing.Tuple[str, Options]:
    """Parse content-type and content-disposition header values.

    E.g. 'form-data; name=upload; filename=\"file.txt\"' to
    ('form-data', {'name': 'upload', 'filename': 'file.txt'})

    Mostly identical to cgi.parse_header and werkzeug.parse_options_header
    but runs faster and handles special characters better. Unescapes quotes.
    """
    value = firefox_quote_escape.sub("%22", value)
    pos = value.find(";")
    if pos == -1:
        options = {}
    else:
        options = {
            m.group(1).lower(): m.group(2) or m.group(3).replace("%22", '"')
            for m in parameter.finditer(value[pos:])
        }
        value = value[:pos]
    return value.strip().lower(), options

This does not support werkzeug.parse_options_header's multiple=True, which I didn't see being used anywhere anyway (is that for accept-header or what?). This is written for Sanic, but if anyone cares to port it to Werkzeug, the code is there (sorry -- don't have time to do it myself). Adding support for multiple=True is a few lines of code (with proper handling of quoted commas), if deemed useful.

CPython cgi.parse_header tests as well as two others that I added based on testing real browsers all pass with this:

import pytest

from sanic import headers

@pytest.mark.parametrize(
    "input, expected",
    [
        ("text/plain", ("text/plain", {})),
        ("text/vnd.just.made.this.up ; ", ("text/vnd.just.made.this.up", {})),
        ("text/plain;charset=us-ascii", ("text/plain", {"charset": "us-ascii"})),
        ('text/plain ; charset="us-ascii"', ("text/plain", {"charset": "us-ascii"})),
        (
            'text/plain ; charset="us-ascii"; another=opt',
            ("text/plain", {"charset": "us-ascii", "another": "opt"})
        ),
        (
            'attachment; filename="silly.txt"',
            ("attachment", {"filename": "silly.txt"})
        ),
        (
            'attachment; filename="strange;name"',
            ("attachment", {"filename": "strange;name"})
        ),
        (
            'attachment; filename="strange;name";size=123;',
            ("attachment", {"filename": "strange;name", "size": "123"})
        ),
        (
            'form-data; name="files"; filename="fo\\"o;bar\\"',
            ('form-data', {'name': 'files', 'filename': 'fo"o;bar\\'})
            # cgi.parse_header:
            # ('form-data', {'name': 'files', 'filename': 'fo"o;bar\\'})
            # werkzeug.parse_options_header:
            # ('form-data', {'name': 'files', 'filename': '"fo\\"o', 'bar\\"': None})
        ),
        # <input type=file name="foo&quot;;bar\"> with Unicode filename!
        (
            # Chrome:
            # Content-Disposition: form-data; name="foo%22;bar\"; filename="😀"
            'form-data; name="foo%22;bar\\"; filename="😀"',
            ('form-data', {'name': 'foo";bar\\', 'filename': '😀'})
            # cgi: ('form-data', {'name': 'foo%22;bar"; filename="😀'})
            # werkzeug: ('form-data', {'name': 'foo%22;bar"; filename='})
        ),
        (
            # Firefox:
            # Content-Disposition: form-data; name="foo\";bar\"; filename="😀"
            'form-data; name="foo\\";bar\\"; filename="😀"',
            ('form-data', {'name': 'foo";bar\\', 'filename': '😀'})
            # cgi: ('form-data', {'name': 'foo";bar"; filename="😀'})
            # werkzeug: ('form-data', {'name': 'foo";bar"; filename='})
        ),
    ]
)
def test_parse_headers(input, expected):
    assert headers.parse_content_header(input) == expected

davidism commented 1 year ago

Thanks for sharing this. After reviewing our implementation, I can definitely see some places we're not parsing according to RFC 9110, such as allowing empty values, quoted keys, and space around the equals.

I did deprecate then remove multiple=True, so that's no longer an issue, although there's still some cruft to be cleaned up.

I like your idea of replacing the \\" internal quotes, our regex isn't quite correct because it tries to account for those. If you remember, can you explain how you arrived at firefox_quote_escape, rather than doing a simple str.replace('\\"', "%22")?

We also support parameter continuation and value charset syntax from RFC 2231, so I need a little more room to work than the comprehension provides.

Tronic commented 1 year ago

The Firefox quote regex apparently is for the case where the quoted value ends with a backslash, e.g. path="c:\", where \" is not an escape sequence for a quote. I did some testing on different browsers back then and this problem probably came up, although it is hard to remember exactly, this much later.

The test provided shows cases where these problems occur with field name containing backslashes, in particular in the last case with Firefox.

davidism commented 1 year ago

I think your quote detection also fails, since \\\\ is used to escape a literal backslash, matching "\\\\" would incorrectly identify the last two characters as an internal quote, rather than a slash then end of string. There's also an issue with not being able to have the literal string %22 in a value.

In that example you gave, a browser that uses slash escapes should be sending "c:\\\\".

davidism commented 1 year ago

Played around with Firefox, Chrome, and curl, and they all substitute %22. Firefox doesn't use slash encoding. Declaring <!doctype html> or not doesn't change the behavior. It's also possible to use the literal string %22, or construct it like %22, and they all get sent as %22 with no way to distinguish those literal strings from a replaced quote character.

davidism commented 1 year ago

Looks like they're both up to date with the WHATWG HTML Standard https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#multipart-form-data which says to percent encode \n, \r and " in the value.

Tronic commented 1 year ago

Sounds like Firefox may have fixed their handling since 2019; it was really problematic. Are you getting different content-disposition from Firefox now vs. the one mentioned with the test? I'll have another look at it, to fix any problems found for the next Sanic framework release coming out this month.

Thanks for investigating this :)

davidism commented 1 year ago

I got the same behavior with Firefox, Chrome, and curl.

davidism commented 1 year ago

Neither browser submits Unicode characters (at least emoji), they substitute HTML character references like "😺.txt" for 😺.txt. But curl does submit Unicode as-is. sigh

I think it's safe to just use the WHATWG HTML Standard for parsing and assume " will be quoted %22, and take everything else as-is without further decoding.

Tronic commented 1 year ago

Oddly enough, I got unicode characters in my tests (no HTML entity escapes), both for filename and for input value (I suppose for input name as well but I didn't test that), both in Chrome and Firefox. Possibly that depends on document charset (using UTF-8 content-type in my tests).

Sanic now unescapes %22 and %0D%0A (to \n only) but nothing else, as per whatwg, and no longer has special handling for backslashes or " (Firefox regex removed). If a value contains one of these sequences verbatim, it will be incorrectly unescaped, but I suppose that accidentally hitting %22 in field values would be quite uncommon.

davidism commented 1 year ago

Good call on the encoding, if <meta charset=UTF-8> is included then browsers send Unicode. Weird that <!doctype html> on its own doesn't do that, since the HTML Standard says that's the only accepted encoding. I expanded the docs about this header a lot based on this discussion.

My implementation is 2-3 times faster. I have a feeling it's due to the much simpler regex, I now do a "basic" parsing for key=value, then two much simpler checks for keys with continuation or charset markers.

I didn't remove handling for slash escapes, since RFC 9110 says HTTP headers use those, while multipart form data is described by HTML Standard to use percent. But I figured out a much simpler way to handle slash escapes: "(?:\\\\|\\"|.)*?" matches surrounding quotes but consumes internal slash escapes.

davidism commented 1 year ago

Check out #2614 for my implementation. You can ignore the continuation and charset handling, I don't think modern clients send that.

pallets / werkzeug

parse_options_header quote handling #1628