pallets / werkzeug

The comprehensive WSGI web application library.
https://werkzeug.palletsprojects.com
BSD 3-Clause "New" or "Revised" License
6.66k stars 1.73k forks source link

If the "Accept" header contains a media-type with a quoted parameter including a space, the resulting MIMEAccept instance is incorrect #1623

Closed exhuma closed 1 year ago

exhuma commented 5 years ago

According to RFC 2045 Section 5.1, mime-type parameters can be quoted. As additional reference, the quoted-string is defined in RFC 7230 Section 3.2.6 (which is the same as in RFC 2616 Section 2.2)

This means that the following media-type is a valid media-type string:

fictitious/media-type; myparam="some value"

Currently, werkzeug sees this as fictitious/media-type; myparam="some, which includes the first double-quote, but drops the rest. Additionally, the parsed MIMEAccept instance also contains an entry for ('value"', 1). In other words, the above media-type results in the following two entries in the MIMEAccept instance:

Appending q=0.5 results in this:

I would expect it to contain just one entry as ('fictitious/media-type; myparam="some value"', 1)

We currently get one such media-type in an accept header and cannot use the normal werkzeug "accept" mechanism which makes this cumbersome. I just discovered this so I still need to write a workaround. I am considering appending the solution as a PR to this ticket. I already have a compliant parser lying around, which is based on a very simple state-based parser.

exhuma commented 5 years ago

For reference, this is the parser I currently use in a different project (it does not yet handle escaped double-quotes though as I'd need a lookbehind):

def argparser(data: str) -> Generator[Tuple[str, str], None, None]:
    """
    Given a string like ``foo=bar; bar=baz; bla="blubb"``, produce a stream of
    key/value pairs.

    This parser is used to parse RFC-2045 (Section 5.1) compliant mimetype
    parameters.

    Example::

        >>> args = 'foo=bar; bar=baz; bla="blubb"'
        >>> for k, v in argparser(args):
        ...     print(k, v)
        foo bar
        bar baz
        bla blubb
    """
    state = 'keyname'
    current_key_name = []
    current_value = []  # type: List[str]
    charstream = iter(data)
    for char in charstream:
        if state == 'keyname':
            if char in ' \t\r\n;':
                continue  # ignore whitespace and other junk
            elif char == '=':
                state = 'value'
                continue
            else:
                current_key_name.append(char)
        elif state == "value":
            if char == ';':
                state = 'keyname'
                key = ''.join(current_key_name)
                value = ''.join(current_value)
                del current_value[::]
                del current_key_name[::]
                yield key, value
                continue
            elif char == '"':
                state = 'quoted_value'
                continue
            current_value.append(char)
        elif state == 'quoted_value':
            if char == '"':
                state = 'keyname'
                key = ''.join(current_key_name)
                value = ''.join(current_value)
                del current_value[::]
                del current_key_name[::]
                yield key, value
                continue
            current_value.append(char)
        else:
            raise ValueError('Unexpected parser state!')

    # return the remaining key/value pair
    if current_key_name and current_value:
        key = ''.join(current_key_name)
        value = ''.join(current_value)
        del current_value[::]
        del current_key_name[::]
        yield key, value
exhuma commented 5 years ago

If there are no objections to this code I will start a PR with this implementation.

Note that I have not done any performance tests in comparison with the existing regex implementation! This is the main point where I am reluctant to open a PR prematurely.

davidism commented 5 years ago

Thanks, but if possible I'd rather fix the regex.

exhuma commented 5 years ago

ok. I'll give it a shot

exhuma commented 5 years ago

Simply removing the white-space exclusion from one of the patterns was enough. This at least covers the existing unit-tests. I added a quoted value with whitespace and it passes.

I can monkey-patch this at work and see if it makes our test-suite pass for the project I am currently working on.

exhuma commented 5 years ago

Okay, the tests don't run yet. There's also a media-type parameter which contains a comma in its value. I'll keep working on it.

exhuma commented 5 years ago

I'm having trouble of getting this regex to work. My biggest two problems are that on the one hand I need to know when I am "inside" double quotes while also splitting the list of media-types by comma.

I have for now resorted to a "manual" parsing which is a bit different to the one I wrote above (that one was an old experiment).

I'm afraid my regex-foo is not strong enough to implement this as regex. I am using my parser as workaround for now which I use in stead of request.accept_mimetypes.

If anyone runs into this same problem, my parser is located in a gist

Note that this has some edge-cases still, especially I did not properly test extra junk white-space. Also some operations might be superfluous. But for now that gist works for me and covers all our internal test-cases. Also the functions quote_arg_value and unquote_arg_value are not symmetric which is I don't particularly like. But I will leave it like this for now.

davidism commented 1 year ago

While refactoring parse_options_header in #2614, I realized that for some reason we're using a different parser for Accept, even though it should use the same one. https://httpwg.org/specs/rfc9110.html#field.accept shows that it uses value; key=value; key=value ..., value2; .... This should use the list parser, then use the options parser on each item.

davidism commented 1 year ago

Looks like the reason for the weird parser was because previous specs allowed for things like text/html;charset=utf-8;q=1;extra=some, where charset was part of the media type, but extra was not. RFC 9110 drops support for extra parameters, and assumes q is last.