httpwg / http-extensions

HTTP Extensions in progress
https://httpwg.org/http-extensions/
431 stars 141 forks source link

Structured Headers: Lists as dictionary values. #816

Closed mikewest closed 5 years ago

mikewest commented 5 years ago

I kinda expect that this is a dupe, as I know there's been conversation on this topic (https://github.com/httpwg/http-extensions/issues/476 is close, for example). But I didn't find anything specifically on point, so I'm erring on the side of another issue.

The discussion of substructures in https://tools.ietf.org/html/draft-ietf-httpbis-header-structure-10#appendix-B.2 is somewhat unsatisfying, as it seems there are a number of cases in which a little support for additional structure would be helpful. In particular, allowing the values of dictionary members to be lists would open up a syntax for a few of the policy headers I'm somewhat responsible for that seems difficult to replicate otherwise.

Even the example in that section:

Example-Thing: name="Widget", cost=89.2, descriptions="foo bar"
Example-Description: foo; url="https://example.net"; context=123,
                     bar; url="https://example.org"; context=456

implicitly treats the descriptions member as a space-separated list which requires parsing logic above and beyond what structured headers themselves offer. It seems like a very common pattern, and one which would be hard to layer in after the fact in a v2 if we decide that syntax really would be nice.

I am 100% sure that y'all considered a syntax like:

Example-Dictionary-Header: ..., memberName=[thing1, thing2, thing3]

(which, of course, could also be used for lists of lists). I'm not sure why you rejected it. The discussion in #476 seemed to be about the narrow value of allowing parsing without context. I happen to believe that that's actually valuable, but I also think there's both semantic and aesthetic value in explicitly demarcating lists as a primitive that can exist basically anywhere sh-item exists in the current spec.

I'm happy to send a PR and add tests to https://github.com/httpwg/structured-header-tests/ if you'd accept them, conceptually. :)

mikewest commented 5 years ago

Examples. Yes. I should have added examples. Off the top of my head, in no particular order:

If more examples would be helpful, I'll spend a little more time digging.

mnot commented 5 years ago

I think it should be possible to do the same thing as we did in List of Lists with inner-list; it might even be best to refactor both List and Dictionary to allow inner-list | item as their payload. Would that work for you?

mikewest commented 5 years ago

I can accept anything that provides the technical capability, so reusing inner-list is certainly an option that I do think that would work from the grammar‘s perspective. If that’s acceptable to y‘all, I’m happy to send a patch.

That said, it seems like it might be marginally confusing, as it creates only a one-character distinction between dictionaries and parameterized lists (label;param1;param2;param3 vs key=list1;list2;list3). It somewhat doubles-down on the notion that a header is unparseable a priori, and that you need to bring in knowledge of the expected types.

I would prefer a serialization that‘s comprehensible without that extra information (e.g. key=[list1,list2,list3]), but that’s a distinct discussion. :)

mikewest commented 5 years ago

(Speaking of parameterized lists, a new syntax would allow us to use lists as a parameterized ist‘s parameter‘s value, while reusing inner-list would be ambiguous.)

phluid61 commented 5 years ago

Doesn't this bring us back to a discussion from way back (which I'm pretty sure I haven't just imagined) that involved assuming all items were lists-of-items? Without a new sigil, do we assume all dictionary member values are lists? Or that all single-element lists are scalar items?

I'm starting to wonder if it's time to reconsider abandoning the familiar , and ; infix list notation for HTTP/1 serialisation, and maybe look at something more C-inspired? (With just enough poison pills to make sure it doesn't work in a JSON parser, of course)

mnot commented 5 years ago

No, it leaves us where we are -- you need to understand the "master" type of a header field to know how to get it into the correct data structure.

phluid61 commented 5 years ago

Dictionary-with-items vs dictionary-with-lists as top-level types?

mikewest commented 5 years ago

I'm poking at the parsing and serialization algorithms, and discovering that I really don't like the magical nature of the result. "If the dictionary's this_key value is a list, ... Otherwise fail parsing." is going to make it very hard to write a generic parser, which seemed to me to be part of the point of moving to structured headers in the first place. This, of course, isn't a problem introduced by this change, but a somewhat fundamental assumption in the design. Since y'all have closed out https://github.com/httpwg/http-extensions/issues/476#issuecomment-376775762, is there any point in bringing it up again? If so, would you prefer it there or on the list?

I'll finish this patch just so y'all can see what it looks like, but I do wonder about the strategy.

mikewest commented 5 years ago

assuming all items were lists-of-items

Parameterized lists' syntax seems to make this difficult, as key;value1=value2;value3 might be reasonably interpreted as key(value1=value2, value3) or key(value1=[value2, value3])

phluid61 commented 5 years ago

assuming all items were lists-of-items

Parameterized lists' syntax seems to make this difficult, as key;value1=value2;value3 might be reasonably interpreted as key(value1=value2, value3) or key(value1=[value2, value3])

I'm confused.

Let me rephrase, in point form:

Or are you saying that I have to pass in even more metadata to say, "if you see a dictionary item key 'foo', parse its value as a list"? Because that's horrible.

phluid61 commented 5 years ago

(I regret my use of the word "item")

mikewest commented 5 years ago

Or are you saying that I have to pass in even more metadata to say, "if you see a dictionary item key 'foo', parse its value as a list"? Because that's horrible.

That's what the PR I put up says, and I think it's in-line with the general strategy this draft takes. As I suggested above, I think it will be hard to write a generic parser if that's the result we end up with.

That said, the approach you suggest would be simpler: if we just assume that every dictionary member's value is a list, then x=y parses unambiguously as {"x": ["y"]}, and x=y;z as {"x": ["y", "z"]}. That would make it impossible to have a scalar dictionary value, but perhaps that's a reasonable tradeoff, as it gives consumers a clear contract?

I worry about backtracking in the case where a given member's value is expected to be a list, but we turn it into a scalar if it happens to have a single item. That seems confusing for consumers.

Making that type information explicit in the serialization would be a way of avoiding that confusion. I'd be in favor of that.

phluid61 commented 5 years ago

Or are you saying that I have to pass in even more metadata to say, "if you see a dictionary item key 'foo', parse its value as a list"? Because that's horrible.

That's what the PR I put up says, and I think it's in-line with the general strategy this draft takes.

Only for top-level types (see #476), which was why I asked if Mark was suggesting we create dictionary-of-scalars and dictionary-of-lists types. It's #443 all over again, again.

mikewest commented 5 years ago

Only for top-level types (see #476), which was why I asked if Mark was suggesting we create dictionary-of-scalars and dictionary-of-lists types. It's #443 all over again, again.

I see.

I think splitting dictionary's behavior into scalar-only or list-only is a little strange, as I fully expect headers (like some of those above) to treat dictionaries as containing both, depending on the key. I'd like it to be the case that that's possible. I agree with you that it's not (without a side-channel of metadata) unless we have distinct syntax. That sounds fine to me, FWIW (and it would sound even better to me if the top-level type was also identifiable. :) ).

I suggested a somewhat obvious label=[thing1, thing2, thing3] syntax above. That seems to be both quite legible for humans, and quite parseable for machines.

kazuho commented 5 years ago

Kind of related, I think it would be beneficial to allow having a dictionary that contains both scalar and compound values. Like a=1, c=d=1;e=2 where c is a compound value consisting of a dictionary with two keys: d and e.

I'd assume that lists are expected provide extensibility. Consider Cache-Control. The parameters (i.e., dictionary members) are added as time goes. In the future, we might want to have introduce a parameter that takes compound value as an argument.

However, IIRC, that's forbidden by the current specification.

mnot commented 5 years ago

I was thinking that we could just have Dictionary and List (getting rid of List of Lists), and for each of them allow members to be items or inner-lists.

Syntactically, all that we'd need to do is to specify that if there's a single-item list, it's serialised as having a trailing ; -- e.g.,

Foo-Dict-Header: a=b, c=d;e, f=g;, h=i

which maps to something like:

{
  'a': 'b',
  'c': [ 'd', 'e' ],
  'f': [ 'g' ],
  'h': 'i'
}

Of course there's other ways to denote this, but this seems the most minimal / natural way to do it. The only thing I suspect is going to be awkward is swapping out the type of an item to an array when a ; is encountered; it will require either scanning forward or changing types dynamically (but implementations will still be able to optimise as they please).

Thoughts?

kazuho commented 5 years ago

@mnot My +1 goes to the approach you propose.

Because that's exactly the generalised syntax for the header field strings that we have now. In my view, Structured Headers is designed as a generalization of header field notations that exist today.

martinthomson commented 5 years ago

@mnot are you operating in knowledge of the schema for 'f', or not? It seems like this is attempting to split the difference somehow and not understanding the framework into which this fits is problematic.

The draft seems to want to say that types are explicit in the syntax. That's cool, but is that really necessary here? If you always have lists as the value of dictionary members, you don't need to answer this question (or have a trailing ';'). You only need to allow for that in the processing of certain fields where only a single (or no) value is permitted.

That is, Foo-Dict-Header: a=b, c=d;e is all you need and that maps to { 'a': ['b'], 'c': ['d', 'e'] }. If processing 'a' assumes a single value, you start by checking the length of the value and rejecting it if a.isPresent() && a.len() != 1.

mnot commented 5 years ago

@martinthomson I'm not, beyond the top-level type.

I see what you're saying about making everything a list. One of my goals is to make specs using SH as error-proof as possible, and so I was planning on saying something like unless a specification using SH explicitly allows an array as a value, assume an array is an error condition or similar.

That would avoid the necessity of them each and always having to say "If there is more than one x, error..." -- and inevitably missing one instance.

That said, I don't dispute that we could make the data model work that way and omit the trailing ; -- but we'd have to get the prose right for consuming specs. Some examples would probably help.

martinthomson commented 5 years ago

Maybe if the API requires an explicit opt-in to get a list, and generates an error if that opt-in isn't present and >1 options appear, then you might get what you want.

mnot commented 5 years ago

We don't specify the API for parsers at that level, but we could put something like that in implementation advice...

phluid61 commented 5 years ago

I was thinking that we could just have Dictionary and List (getting rid of List of Lists), and for each of them allow members to be items or inner-lists.

[...]

Of course there's other ways to denote this, but this seems the most minimal / natural way to do it. The only thing I suspect is going to be awkward is swapping out the type of an item to an array when a ; is encountered; it will require either scanning forward or changing types dynamically (but implementations will still be able to optimise as they please).

Thoughts?

Three:

  1. I think a trailing ; looks like a serialisation error.
  2. How long before someone files a bug somewhere that x=a;b; "doesn't work but should"? Or y,?
  3. Does this and #817 mean we should make z=; valid?

Regarding the discussion about implementation guidance and APIs and such; how is array-vs-item (or array-of-one-vs-array-of-many) any different from string-vs-integer?

mnot commented 5 years ago

@phluid61

  1. I don't disagree. It's less ugly than anything else I could think of. However, @martinthomson's approach avoids all of that.
  2. We could write the algorithms to account for that. But if they're using a serialiser, they won't hit that, and if they're not, the parsers should at least behave in a consistent way.
  3. No; #817 is about top-level lists, not these.
mnot commented 5 years ago

@martinthomson thinking a bit more, I think a proposal along the lines you're discussing would be:

  1. All dictionary parameter values are lists.
  2. List and list of lists would still be separate types (otherwise, everything would be a list of lists).

I have to say I'm not crazy about either of these; the former for reasons explained (I suspect we could come up with something, but it's going to be awkward), and the latter because then we'll have two different ways to handle inner lists.

We could also come up with a Dictionary of Lists type, but that would not fit well with the common use case, which is that some dictionary items will be singular types, others will be lists.

To me, the thing we should optimise for here is the data model -- it's going to persist longer than the serialisation, if we do this right.

So IME we're back at the start -- we need some sigil in the serialisation that indicates whether a dictionary (and list?) item is an array.

I proposed using the presence of ; (with a trailing ; if the array has only one member). That's not the only way to do it, of course, but it does seem "natural" in headers (with the exception of a one member array, which I personally think is acceptable).

Are there other suggestions? I'd like to ship a new draft by the deadline (Monday), so unless we can resolve this quickly, I'm inclined to ship ; with a note asking for feedback.

martinthomson commented 5 years ago

For me, I'd be comfortable with lists in all these places. An accessor function for a single value could generate an error if the cardinality is wrong. That's not at all difficult to implement. It also avoids problems: where x=a; is presented as two values, one being the empty string; and x=a is still the predominant form for a list of one item, which you risk invalidating as a valid client of this doc by requiring decoration.

I agree that the data model is most important here, but the question is whether the canonical model that we operate from is list of "things" or list of lists (and dict of "things" or dict of lists). List of lists is a more limited model, from which you can get to the stricter forms as needed. The question is whether you need that explicitly signaled on the wire, or whether it is sufficient for a generic processor to be ignorant of certain distinctions of type.

Part of the goal of this doc is to extend the role of the generic processor somewhat. But I don't think that this completely rules out contextual handling of data, and nor could it ever. So you have to pick where to draw the line.

mnot commented 5 years ago

See PR #824 for a roughed-in proposal.

mnot commented 5 years ago

@martinthomson yes, but as I said above, specs would still need to be written very precisely to avoid error, e.g.,

Use the single value accessor (defined in {{}}) to retrieve a value from the "foo" member. If it is not an integer, fail parsing; otherwise...

I think that's going to trip up spec authors; they'll forget it and then we'll be in the Wild West. Compare:

If the "foo" member is not an integer, fail parsing; otherwise...

Simples.

Also, with the "everything is a list" approach, it implies that all list items are arrays, which is tedious in the common case.

mikewest commented 5 years ago

@mnot: The ; (and trailing ;) proposal is fine if that's what y'all want to run with. I'm happy to see that we're going to be able to support the use case, and I can live with whatever syntax y'all can agree upon.

That said, it really is less legible (at least to humans (who matter!)) than the [...] syntax suggested several times in this thread. That really does seem to me to be the most obvious and unambiguous annotation (and I'd note that we're all using that syntax in prose to explain to each other what we mean by various other proposals). x=[y] is, IMO, significantly more clear as a list value than x=y;.

Can you help me understand why you're ignoring it? :)

mnot commented 5 years ago

@mikewest there are two reasons; one is that we want this to look as much like "traditional" HTTP headers as possible (semicolons are often used for substructure in list-based headers). The other is that we don't want this to look like JSON or another existing format; if it does, people might think that the same conventions apply, or that they can use an existing parser.

phluid61 commented 5 years ago

FWIW, I'm all on board with "not JSON", but I keep feeling the nostalgia factor is causing us grief. There is a third way: make up a whole new syntax. It could work, as long as everyone hates it equally.

mnot commented 5 years ago

@phluid61 I hear you, but I also want to ship this. We're very close, if we can get get these last few issues closed.

mikewest commented 5 years ago

@mnot: Thanks! As someone who ends up minting a lot of headers, I don't think I buy either of those reasons (and, honestly, would have preferred JSON in the first place!). Trailing semicolons are wacky, and I still think that some kind of bracketing ([...], {...}, <...>, (...), «...», ...) syntax is significantly more comprehensible. Still, as long as the functionality is in place, I will use whatever spelling y'all land on.

mnot commented 5 years ago

OK, so let's walk down that path a bit and see where it takes us.

If we do this, we need brackets, and we also need delimiters between the items. Those can't be any already used delimiter, so I think that leaves us with ; and whitespace.

That gives us a few permutations to consider (first one being just semicolons, as in the PR):

Test-Header: a=b, c=d;e, f=g;, h=i
Test-Header: a=b, c=(d;e), f=(g), h=i
Test-Header: a=b, c=(d e), f=(g), h=i

Preferences? Remember that these are all single-character tokens, but IRL they'll be integers, floats, binary, quote-delimited strings, etc.

I've chosen parens here because [...] looks like JSON arrays (see above), <...> I'd like to reserve for links down the road, and «...» isn't ASCII. Thoughts about that?

phluid61 commented 5 years ago

OK, so let's walk down that path a bit and see where it takes us.

If we do this, we need brackets, and we also need delimiters between the items. Those can't be any already used delimiter, so I think that leaves us with ; and whitespace.

Why not? With paren-matching and a stack-based algorithm (which we have), it could also be COMMA.

That gives us a few permutations to consider (first one being just semicolons, as in the PR):

Test-Header: a=b, c=d;e, f=g;, h=i
Test-Header: a=b, c=(d;e), f=(g), h=i
Test-Header: a=b, c=(d e), f=(g), h=i
Test-Header: a=b, c=(d,e), f=(g), h=i

Also the non-dictionary list counterparts:

Test-Hodor: a, b;c, d;, e
Test-Hodor: a, (b;c), (d), e
Test-Hodor: a, (b c), (d), e
Test-Hodor: a, (b,c), (d), e

Preferences? Remember that these are all single-character tokens, but IRL they'll be integers, floats, binary, quote-delimited strings, etc.

I think I like parens and semicolons best. It gives us a more straight-forward streaming parser (identify by first character), a little bit of truncation detection, a deliberate and visual separator, and resistance against the weird "auto-concatenation with commas" issue that currently only (potentially) affects strings.

I still have a question about empty inner-lists. () is even easier to generate and expect to be valid than a lonely ; -- should they be allowed in the data model? If not: the parsing algorithm would have another edge-case to detect.

I've chosen parens here because [...] looks like JSON arrays (see above), <...> I'd like to reserve for links down the road, and «...» isn't ASCII. Thoughts about that?

:+1:

mikewest commented 5 years ago

Thanks for taking another look at this!

Why not? With paren-matching and a stack-based algorithm (which we have), it could also be COMMA.

Or |. Or /. Or :. ASCII is full of options. And just imagine the possibilities extended ASCII could bring! :)

I agree with both of you, though, that parens and semicolons seem like a reasonable choice.

I still have a question about empty inner-lists.

() as an explicit signal of an empty list makes a good deal of sense to me.

Also the non-dictionary list counterparts:

And, for completeness, parameterized lists?

Test-Header: a;b=(c;d);e, f;g=(h);i=(j;k)
phluid61 commented 5 years ago
Test-Header: a;b=(c;d);e, f;g=(h);i=(j,k)
                                      ^ ?
mikewest commented 5 years ago

Good eye! That was a typo. :( I've corrected it.

mnot commented 5 years ago

If we think we want to use this for parameterised lists too, we should make the internal delimiter WSP, to avoid visual confusion.

Consider:

Test-ParamList: abc; d=e; f=(g;h;i;j); k; l=(m;n); o, pqr; s=t

vs.

Test-ParamList: abc; d=e; f=(g h i j); k; l=(m n); o, pqr; s=t
mnot commented 5 years ago

I kind of like the above, in that the original example "foo bar" just becomes (foo bar)...

mikewest commented 5 years ago

I think I prefer explicit separation via a visible delimiter. But that's only a preference, and I'm happy to defer to yours.

mnot commented 5 years ago

Whitespace is the most visible delimiter :)

I've updated the PR; PTAL. I have't yet implemented or written tests for this, so it may need some adjustments, but I'd like to get this general approach agreed to first.