Closed exhuma closed 1 year ago
For reference, this is the parser I currently use in a different project (it does not yet handle escaped double-quotes though as I'd need a lookbehind):
def argparser(data: str) -> Generator[Tuple[str, str], None, None]:
"""
Given a string like ``foo=bar; bar=baz; bla="blubb"``, produce a stream of
key/value pairs.
This parser is used to parse RFC-2045 (Section 5.1) compliant mimetype
parameters.
Example::
>>> args = 'foo=bar; bar=baz; bla="blubb"'
>>> for k, v in argparser(args):
... print(k, v)
foo bar
bar baz
bla blubb
"""
state = 'keyname'
current_key_name = []
current_value = [] # type: List[str]
charstream = iter(data)
for char in charstream:
if state == 'keyname':
if char in ' \t\r\n;':
continue # ignore whitespace and other junk
elif char == '=':
state = 'value'
continue
else:
current_key_name.append(char)
elif state == "value":
if char == ';':
state = 'keyname'
key = ''.join(current_key_name)
value = ''.join(current_value)
del current_value[::]
del current_key_name[::]
yield key, value
continue
elif char == '"':
state = 'quoted_value'
continue
current_value.append(char)
elif state == 'quoted_value':
if char == '"':
state = 'keyname'
key = ''.join(current_key_name)
value = ''.join(current_value)
del current_value[::]
del current_key_name[::]
yield key, value
continue
current_value.append(char)
else:
raise ValueError('Unexpected parser state!')
# return the remaining key/value pair
if current_key_name and current_value:
key = ''.join(current_key_name)
value = ''.join(current_value)
del current_value[::]
del current_key_name[::]
yield key, value
If there are no objections to this code I will start a PR with this implementation.
Note that I have not done any performance tests in comparison with the existing regex implementation! This is the main point where I am reluctant to open a PR prematurely.
Thanks, but if possible I'd rather fix the regex.
ok. I'll give it a shot
Simply removing the white-space exclusion from one of the patterns was enough. This at least covers the existing unit-tests. I added a quoted value with whitespace and it passes.
I can monkey-patch this at work and see if it makes our test-suite pass for the project I am currently working on.
Okay, the tests don't run yet. There's also a media-type parameter which contains a comma in its value. I'll keep working on it.
I'm having trouble of getting this regex to work. My biggest two problems are that on the one hand I need to know when I am "inside" double quotes while also splitting the list of media-types by comma.
I have for now resorted to a "manual" parsing which is a bit different to the one I wrote above (that one was an old experiment).
I'm afraid my regex-foo is not strong enough to implement this as regex. I am using my parser as workaround for now which I use in stead of request.accept_mimetypes
.
If anyone runs into this same problem, my parser is located in a gist
Note that this has some edge-cases still, especially I did not properly test extra junk white-space. Also some operations might be superfluous. But for now that gist works for me and covers all our internal test-cases. Also the functions quote_arg_value
and unquote_arg_value
are not symmetric which is I don't particularly like. But I will leave it like this for now.
While refactoring parse_options_header
in #2614, I realized that for some reason we're using a different parser for Accept
, even though it should use the same one. https://httpwg.org/specs/rfc9110.html#field.accept shows that it uses value; key=value; key=value ..., value2; ...
. This should use the list parser, then use the options parser on each item.
Looks like the reason for the weird parser was because previous specs allowed for things like text/html;charset=utf-8;q=1;extra=some
, where charset
was part of the media type, but extra
was not. RFC 9110 drops support for extra parameters, and assumes q is last.
According to RFC 2045 Section 5.1, mime-type parameters can be quoted. As additional reference, the
quoted-string
is defined in RFC 7230 Section 3.2.6 (which is the same as in RFC 2616 Section 2.2)This means that the following media-type is a valid media-type string:
Currently,
werkzeug
sees this asfictitious/media-type; myparam="some
, which includes the first double-quote, but drops the rest. Additionally, the parsedMIMEAccept
instance also contains an entry for('value"', 1)
. In other words, the above media-type results in the following two entries in theMIMEAccept
instance:('fictitious/media-type; myparam="some', 1)
('value"', 1)
Appending
q=0.5
results in this:('fictitious/media-type; myparam="some', 1)
('value"', 0.5)
I would expect it to contain just one entry as
('fictitious/media-type; myparam="some value"', 1)
We currently get one such media-type in an accept header and cannot use the normal werkzeug "accept" mechanism which makes this cumbersome. I just discovered this so I still need to write a workaround. I am considering appending the solution as a PR to this ticket. I already have a compliant parser lying around, which is based on a very simple state-based parser.