Enumerations as first class shapes (as opposed to string constraint)

Baccata commented 2 years ago

It might be a pretty bad timing to even consider it for 2.0 of the IDL, but a few engineers I've talked to feel it'd be a lot more natural if enumerations were a first-class shape as opposed to a constraint on strings.

The main rationale for this ask is that the encoding of enumerations in various protocols is often not-string based, but rather ordinal-based.

Another example is protobuf, where enumerations are encoded as integers. As a matter of fact, my team wants to generate proto files from smithy definitions in order to have the "source of truth" be written in smithy, in order to benefit from the smithy tooling.

Smithy enforcing enumerations to be strings feels like an openapi-ism, and forces some tools that do not consider enumerations to be strings to essentially violate the smithy semantics, which isn't ideal.

mtdowling commented 2 years ago

This is a great topic. I was just recently talking about this with the Smithy team :) I'll collect my thoughts here so we can discuss.

The main driver for me for making enums a shape is primarily so that each enum value can be a member and have a dedicated shape ID, traits associated with them, filtered out of models just like other shapes, etc. Right now they’re almost like members with a very limited set of properties. There are of course lots of trade offs and complexities with this approach.

For me, enums with ordinal values isn't that important. I think representing enums as numbers is worse than strings, especially for web services and debugability, which I'll touch on later.

Reconciling members with simple shapes

First off, the member/aggregate type problem. Enums are serialized as and treated like simple types in probably every protocol, and they almost always have some kind of scalar like value associated with them in PLs (a number, a constant, etc). We'd want an enum in Smithy to be considered a simple type, but it would still have members. I think we'd need to make member shapes distinct from its current aggregate shape classification, and also redefine aggregate shapes as shapes that can contain one or more values. Aggregate shapes are currently defined around whether they have members.

Targeting a shape from enum members

Enum members would also need a shape to target. The work I’m doing with unit types in #980 could mean that the members target Unit, which would be fine— it would give it the same form as every other member without needing to target a meaningful type. We’d hide this in the IDL. Interestingly, unions could technically function as an enum if every member targets the unit type, but that's a degenerate case and not explicit enough.

Representing unknown enum values in code

Next up, we have to consider service evolution in a client/server interaction. Servers are going to need to add more enums in the future (and IME, usually when someone thinks their enum will never need to change in the future, they're wrong). So regardless of if we supported ordinal based enums in addition to string based enums, they both need to decompose down to simple values like strings and integers so that a client that receives a newly introduced enum value it doesn't recognize can still use the value or even send the value to the service without failing at deserialization time. If you look at the current set of Smithy code generators, they all allow enums to be passed around as strings or as more PL-like enum types.

That's actually one of the reasons enums were only a constraint trait on string shapes. It is a string shape, but a specialized string shape with known constant values. This is similar to how many languages can treat enums as a subtype of integer, but Smithy's enum is just a string today. Implementations need to have a way to represent unknown enums and to accept raw values in place of enums.

Enum serialization

Strings were also chosen in Smithy because using a string have a major advantage over ordinals in that they are human-readable and have meaning independent of the model. This improves debugability, wire logs, things like CloudTrail logs, error messages, and so on. They're human-readable without a reverse mapping. The downside is that they take up more space in memory than a more efficient int based enum.

Ordinal enums make sense when you're only dealing with types in code, but when you start sending them over the wire, strings are superior. (As an AWS API Bar Raiser, I'd be hesitant to allow an AWS service team to define an ordinal based enum).

Default values

Enums have to have a default value to make the default trait proposal work (see #920). With enum strings right now, the default value is an empty string, which is fine because (a) an enum could define an explicit empty value (b) implementations need to handle unknown values anyways. With an ordinal enum, the default would have to be 0.

Add an intEnum trait?

If we really wanted to add support for ordinal enums, then I think there's a reasonable argument for adding a new trait just like the existing enum trait that can only target an integer shape. This gives the same properties in that it would be a specialized integer, implementations need to support unknown types as integers, etc, but it has the same drawbacks in that they don't have real member shapes.

Add a built-in enum type that can be strings or numbers

Another option could be to add an enum shape that is either a string or number depending on the protocol. That sounds reasonable at first, but I don't see how this could work in Smithy because code generators need to generate types based on shapes independent of the protocol, and if the shape has no known decomposed type, then we don't have a good way of handling unknown enum values. We really need to know if an enum contains constant strings or constant integers.

Extend enum and intEnum from string and integer

We could also add an enum type that extends string (meaning, anything that supports strings also inherently supports enums, like the httpHeader trait for example); and we could also add an intEnum type that extends from integer. Both would be simple types, decompose to specific types (string/integer), and they'd have real members.

For example, an enum string (this could also be an automatic conversion when upscaling 1.0 models to 2.0):

enum Action {
    @enumValue("move")   // <-- optional string value
    MOVE

    QUIT  // <-- string value defaults to "QUIT"
}

An intEnum:

intEnum Action {
    @enumOrdinal(0) // <-- Explicit ordinals are required
    MOVE

    @enumOrdinal(1)
    QUIT
}

We'd have to require explicit ordinals because members can be added and removed from models based on filtering.

Ok, those are my opening thoughts for now. I'm very interested to hear yours.

Baccata commented 2 years ago

Lots of interesting thoughts there.

The main driver for me for making enums a shape is primarily so that each enum value can be a member and have a dedicated shape ID, traits associated with them, filtered out of models just like other shapes, etc.

👍 That is absolutely a pain-point experienced in one of the the usecases for smithy. Any solution targeting it would be great.

Targeting a shape from enum members

Enum members would also need a shape to target. The work I’m doing with unit types in #980 could mean that the members target Unit, which would be fine— it would give it the same form as every other member without needing to target a meaningful type. We’d hide this in the IDL. Interestingly, unions could technically function as an enum if every member targets the unit type, but that's a degenerate case and not explicit enough.

So interestingly, that's exactly how Scala 3 model enumerations. As a matter of fact, Scala 3 introduces the enum keywords to define algebraic data types (which unions are somewhat related to), and enumerations are just a specific case.

Still interestingly (but way more anecdotal), the absence of input/output translating to Unit is exactly how I've modelled things in my tooling. Having it reified in the IDL would be amazing.

I do agree that having enumerations be their own thing is the right approach : people are used to it being the case, and it prevents implementors of tooling from having to do a little dance to verify whether they are dealing with actual enumerations or plain unions.

Representing unknown enum values in code

Implementations need to have a way to represent unknown enums and to accept raw values in place of enums.

I think that's a little bit of a strawman : the ability for an enumeration to receive more values is protocol dependant, and is conceptually similar to what the default trait solves for. My take on it, since I'm building tooling that is variance-aware (ie computes compatibility based on the position of shapes in inputs/outputs of operations), is that an unknown enum value would result in an error. If that is true in the protocols that I'm defining, I totally understand your position on the matter.

Similarly : whether enums have a default value is, I think, usecase/protocol dependant (I'm gonna avoid digressing on the subject considering we've already discussed it in length).

BTW : regarding variance-based compatibility rules, I've got this write-up on the subject.

Enum serialization

Strings were also chosen in Smithy because using a string have a major advantage over ordinals in that they are human-readable and have meaning independent of the model

I totally agree there, in the context of protocols where human-readable serialisation formats are used. But smithy aiming at being protocol-agnostic implies that the problematic of serialisation needs to be decoupled from the problematic of data modelling (at least, to an extent)

Add an intEnum trait?

I'm not in favour of this, for the reason stated in my previous paragraph. I prefer approaching the problem as follows : how can we make it so that the concept of enumeration is (to an extent) decoupled from serialisation (or at least, appears to be decoupled from it). To use an analogy : I really like that the concept of timestamp is first class in smithy, because it is up to the protocol to state how timestamps should be encoded. If you had split the timestamp between two separate date-time and epoch, I'd have found it weird.

Extend enum and intEnum from string and integer

enum Action {
    @enumValue("move")   // <-- optional string value
    MOVE

    QUIT  // <-- string value defaults to "QUIT"
}

intEnum Action {
    @enumOrdinal(0) // <-- Explicit ordinals are required
    MOVE

    @enumOrdinal(1)
    QUIT
}

I agree with this idea whole-fully. Firstly the syntax is great an intuitive, but also it addresses my concern, and an "implicit relationship" between enums and strings/integers when used in combination with protocol-specific annotations is useful and pragmatic. It also solves my concern of offering a decoupling between enumerations and serialisation, even if only in appearances.

mtdowling commented 2 years ago

First-class enum shapes were released yesterday in IDL 2.0 (1.23 release): https://aws.amazon.com/blogs/developer/introducing-smithy-idl-2-0/

smithy-lang / smithy