Relax `map` shape validation to allow for other key types than "string"

Baccata commented 10 months ago

Screenshot 2024-01-24 at 18 18 26

Currently smithy maps can only take string shapes as keys. Considering the protocol-agnostic nature of smithy, this is somewhat arbitrary and prevents its usage in some legitimate usecases.

Intuitively, I guess this limitation is driven by the fact that it makes sense in the context of AWS protocols and tools. In smithy itself, having arbitrary shapes as map keys may not play well with the current canonical "object-node" representation of maps (representation used by traits in particular, but also other things).

I personally think the limitation solvable :

protocol specific validators could be created to enforce the same limitation in the context of AWS protocols
the canonical object-node representation of maps could be used for a lot more keys than just strings (numerics in particular)
for more complex shapes, a canonical representation of [{"key" : ???, "value: ???}] could be used.
Alternatively to 3, the constraint on map keys could be relaxed to incorporate strings/numerics/booleans/timestamps/bytes/enums/intEnums but exclude structures/unions/documents/blobs.

mtdowling commented 9 months ago

I can expand on some of the reasons behind this limitation.

Maps right now support string and enum keys. I wouldn't characterize this limitation as arbitrary, but rather a deliberate simplification in the model that in turn simplifies code generators (e.g., no need to worry about making every generated type hashable) and protocols (like you mentioned).

There hasn't been a lot of demand or use cases over the years (inside or outside of Amazon) to expand maps to support other key types. I'd expect that if such a problem occurred, the model could adopt a list of structs approach to avoid any protocol or programming language pitfalls. If a generator really needed to make this convert to a map in a target programming language, I suppose a generator could introduce their own custom traits to facilitate this as well.

Supporting all types as map keys (e.g., structures, unions, etc) isn't something we'd do since that brings in new requirements to code generators that every generated type can be used in their programming language's map type (this is a protocol agnostic concern and unrelated to what protocols AWS supports).

Types that aren't easily comparable or hashable like floats, doubles, documents (since it's sort of arbitrary), and probably timestamps won't work across languages for map keys either (e.g., Rust doesn't let you use floats as a map key).

So that leaves potentially adding support for non-float numbers (byte, short, integer, long, bigInteger, bigDecimal), intEnum, blob, and boolean.

numbers: anyone building a JSON protocol would either have to forbid map keys of numbers, serialize numbers as strings, or define some kind of automatic strategy to flip from using a true JSON object to using a list of objects. Serializing them as strings isn't as clear-cut as it sounds since you'd have to deal with all the different ways numbers can be written and maybe define a canonical format (e.g., { "1": "a", "01": "b", "1e0": "c" }). I'd also worry about the ramifications of suddenly requiring that bigInteger or bigDecimal have to be hashable now. Sure, protocols can forbid these keys, but these kinds of protocol limitations often come back to bite us over time as teams add support for new protocols.
blob: I'm unsure about the utility here, but I'm more worried about whether the choices people have made to-date about code generating blobs in their programming languages are compatible with their map data structures.
boolean: while it's probably readily hashable in most programming languages, I've not seen any demand for boolean keys that makes me think code generators should need to worry about it.

Baccata commented 9 months ago

Hey Michael, thank you very much for providing all this context, this is very helpful.

Supporting all types as map keys (e.g., structures, unions, etc) isn't something we'd do since that brings in new requirements to code generators that every generated type can be used in their programming language's map type (this is a protocol agnostic concern and unrelated to what protocols AWS supports).

Types that aren't easily comparable or hashable like floats, doubles, documents (since it's sort of arbitrary), and probably timestamps won't work across languages for map keys either (e.g., Rust doesn't let you use floats as a map key).

I think this argument is sensible : being mindful of the various languages semantics and trying to cater to the lowest common denominator in what programming languages allow for, in the context of an IDL, makes sense. The proto IDL has similiar limitations in place.

This leaves us with the rest :

So that leaves potentially adding support for non-float numbers (byte, short, integer, long, bigInteger, bigDecimal), intEnum, blob, and boolean.

For the sake of the debate : let's put aside blobs and booleans.

Numbers

numbers: anyone building a JSON protocol would either have to forbid map keys of numbers, serialize numbers as strings, or define some kind of automatic strategy to flip from using a true JSON object to using a list of objects. Serializing them as strings isn't as clear-cut as it sounds since you'd have to deal with all the different ways numbers can be written and maybe define a canonical format (e.g., { "1": "a", "01": "b", "1e0": "c" }).

This is where the main point of contention lies, I think : these are protocol-specific arguments used to support why a protocol-agnostic IDL shouldn't necessarily support a specific feature. And although these are good explanations as to why you'd not want to support numbers as map keys in the context of AWS protocols, I'd argue that it's the protocol designer's prerogative to decide whether to support certain shapes or not, and what the serialisation/routing semantics should be when the shapes are supported.

Sure, protocols can forbid these keys, but these kinds of protocol limitations often come back to bite us over time as teams add support for new protocols.

I'm not quite sure how to respond to that to be honest. I can certainly empathise with the maintenance burden (which this very discussion is contributing to, sorry about that 😅), but I don't think that it makes the request illegitimate, in particular if it puts smithy in a better place to capture the semantics of some binary protocols (see next section for a motivating example).

Moreover, there is a precedent, in that the smithy.api@protocolDefinition trait has a noInlineDocumentSupport property. The presence of document in the set of "simple types" is, I think, a brilliant decision, and it been a shame to not include the concept of document in the IDL because they don't have "natural" translation in XML protocols.

(Full disclosure, I think the noInlineDocumentSupport property is a miss : a selector would have been much more flexible : some protocols could decide that Blobs should not be used ...)

I'd also worry about the ramifications of suddenly requiring that bigInteger or bigDecimal have to be hashable now.

I don't know enough about their representation in non-jvm languages to comment over this. Probably safe to ignore for now.

Context of this issue

To elaborate on the relevance of the issue, here's some context/motivation : we have elected to support gRPC as a protocol in our internal smithy-related offering (both at an IDL level and in our runtime-libraries). This allows applications to offer different means of interactions for callers at a very reduced cost to our service engineers, who'd be able to enable gRPC in their applications with a couple lines of code (as opposed to using separate code-generators depending on the protocol). Overall, using smithy to describe gRPC interactions and protobuf serialisation works pretty well, and really fits in the "smithy as a source of truth" paradigm that my team is pushing for in our organisation.

However, one thing that has appeared recently is a desire for teams to capture "passthrough data". For instance, in the context of JSON protocol, the requirement is essentially to provide a mean to capture "unknown fields" from received JSON objects. We can easily amend our protocol with an additional @unkownJsonFields trait (name TBD), that could annotate a map, signaling to the deserialiser that any JSON fields it doesn't recognise should be stored in the map rather than discarded. Here's what it'd look like :

map OtherInfo {
  key: String
  value: Document
}

structure Person {
  @required
  firstName: String
  @required
  lastName: String

  /// This captures any inlined unkown field, as opposed to expecting a `otherInfo` json field
  @unkownJsonFields
  otherInfo: OtherInfo
}

This would allow for capturing the entirety of the a json payload like {"firstName" : "John", "lastName": "Doe", age: 34}, which is useful for some service that need to operate of a subset of a payload but passthrough the rest.

Now this is all good and well for JSON. However, in the context of Protobuf, the equivalent of OtherInfo type would be :

map OtherInfo {
  key: Integer
  value: Blob
}

Which is currently impossible due to the limitation that map keys should be strings.

The case of retro-fitting existing protocols into Smithy.

I think where Smithy shines is that it allows for the retro-fitting of existing protocols.

It is exemplified by our alloy library, which captures various bits of semantics that come from common REST/JSON APIs patterns that the official smithy Prelude (smithy.api) doesn't capture. Thins like alternative JSON encodings for unions, or some ad-hoc constraints that can be described in openapi (uuid format) that we wanted to capture when moving towards smithy.

Another example is the Build Server Protocol, which was initially described as a bunch of markdown documents, which now takes a smithy specification as a source of truth for documentation and bespoke code-generators alike. The BSP is essentially an "extension" of the LSP, and therefore some bespoke jsonRPC-related protocol semantics were needed to be captured.

I know of some other cases of companies that are using smithy as a source of truth for some complex data pipelines involving both Google's BigQuery tables and ElasticSearch tables.

My point here is that Smithy really shines at capturing semantics of existing ad-hoc protocols in the wild, and that the flexibility of its core semantics participate to this. It's something that I think is worth leaning into, as it participates to the language becoming more popular in the industry.

Allowing for supporting integer/long map keys would allow for exhaustively capturing protobuf semantics (and probably the semantics of other binary protocols), which is valuable in the context of my organisation, but probably also in the context of the wider ecosystem.

mtdowling commented 9 months ago

Thanks for laying out your use case.

I think expanding maps to support integer and long keys isn't too controversial. Do you have a hard requirement for long right now or would just integer suffice?

As a baby step of progress, I was inspired by your fair critique of noInlineDocumentSupport and added the ability to constrain shape closures based on the presence of a trait. This would deprecate noInlineDocumentSupport and be used to automatically validate that a service with a protocol trait adheres to whatever limitations or requirements a protocol might have: https://github.com/smithy-lang/smithy/pull/2156.

Baccata commented 9 months ago

Do you have a hard requirement for long right now or would just integer suffice?

At this point in time, I'm only interested in integers, but I think allowing for longs too would allow for the smithy metamodel to fully supersede the proto metamodel, which is a nice thing to be able to say (ie anything that can be described in proto could be described in smithy ... except for maps that use booleans as keys but who realistically does that 🤷 ). I'm trusting your judgement.

Added the ability to constrain shape closures based on the presence of a trait.

This is really nice, kudos !

smithy-lang / smithy