[Feature Request] Implement union types (rather than nominal enums)

nmsmith commented 1 year ago

Summary

I propose to consider adding TypeScript-style union types to Mojo — instead of Rust-style nominal enums. This approach seems like it would work well given that Mojo plans to extend Python, which is a dynamically-typed language. In addition to enabling type declarations for Python libraries, union types would allow Mojo users to avoid the modelling issues associated with nominal enums.

Union types are strictly more general than nominal enums, and so have the potential to make enums obsolete.

Update Jan 2024: I removed a section that was too speculative, and I posted new thoughts on how enums/unions might be modelled using dependent types. This approach would supercede most of the design details discussed in this original post.

Motivation

Advantage 1: Specifying precise return types

Most statically-typed languages that implement type-safe enums use nominal typing. For example, in Rust, to define a function that can return two kinds of error, one must first define (and name) an enumeration of those error kinds:

enum IOParseError {
    IOError(...),
    ParseError(...),
}

One can then specify the return type of the error-producing function as follows:

fn parseFile(file: &str) -> Result<Success, IOParseError>

This approach doesn't scale when different functions can return different combinations of errors. For example, maybe we have:

The function parseFile (above) that produces either an IOError or a ParseError.
A function readFile that only produces an IOError.
A function parseString that only produces a ParseError.
A function parseAndExecuteFile that produces either an IOError, a ParseError, or a RuntimeError.
...etc.

The issue here is that each combination of errors requires a different enum to be declared, and each enum needs to have unique constructor names. Ultimately, this makes expressing these return types using nominal enums a nightmare. (And this problem isn't unique to error types. There are many other scenarios in which this issue occurs.)

In contrast, modelling these functions in a dynamically-typed language (such as Python) is trivial. For example, the implementation of parseFile will simply return whatever error object is appropriate. No up-front declaration is required.

Because of this flexibility afforded by dynamically-typed languages, retroactively adding a type system to such languages has proved challenging. But in recent years, TypeScript has demonstrated that it is possible to statically-type these kinds of programs. TypeScript's trick is to offer union types, via a union operator |. The union operator can be used to directly express a disjunction of types. We can use the union operator to directly enumerate the specific errors that the aforementioned functions can return:

function parseFile():           Result<Success, IOError | ParseError> {...}
function readFile():            Result<Success, IOError> {...}
function parseString():         Result<Success, ParseError> {...}
function parseAndExecuteFile(): Result<Success, IOError | ParseError | RuntimeError> {...}

For brevity, I will omit a full description of how TypeScript's union types work. For more information, check out the relevant section of the TypeScript handbook.

Advantage 2: Static type-checking (and optimization) of Python code

Given that Python is a dynamically-typed language, Python programmers mix values of different types together in myriad ways. For example, the use of heterogeneous collections is common amongst Python programmers.

Accordingly, it would be beneficial if Mojo's static semantics could have a richer understanding of Python, so that more Python code is able to "just work" when copy-pasted into Mojo — even if the code is pasted into a statically-checked fn.

Notably, Python already supports the use of union types as part of its type annotations feature. However, Python assigns no static semantics to these annotations. In this post, I'm arguing that giving union types a rigorous static semantics will enable Mojo users to construct expressive, statically-typed code that can be compiled very efficiently.

Design

Developing a precise semantics for union types will require a large amount of work. Thankfully, TypeScript is a helpful piece of prior art. Scala 3 has union types as well. But as I explain below, I think we would want to differ in the fine details.

I'm unable to dedicate the time required to flesh out a full design proposal. However, I can skim over some potential questions and challenges that may arise:

If structs Foo and Bar have a common method (not a trait method), should a person with access to a value of type Foo | Bar be able to call it? (TypeScript allows this.)
- My belief is that no, this shouldn't be allowed. This utility that this would provide is already provided by traits/protocols. By that I mean: if you need to write a function that accepts either Foo or Bar and calls a method that they have in common, then you should put that method in a protocol P, have Foo and Bar implement that protocol, and then have your function accept an instance of P rather than Foo | Bar.
- However, let's assume your function truly needs to accept Foo | Bar, because you want to be able to pattern match on the argument and handle each struct differently. If Foo and Bar implement a common protocol P, should the methods of that protocol be callable? This time I think the answer is yes, this should be allowed. Indeed, I relied upon this behaviour earlier, when I described how union types could obviate "trait objects". In short: it should always be possible to upcast a value of type Foo | Bar to a value of type P.
- A similar rule would apply if Foo and Bar were protocols (rather than structs). A common method should only be callable on Foo | Bar if the method belongs to a protocol that both Foo and Bar inherit from. (This is assuming Mojo supports protocol inheritance.)
- Rationale:
  - If two structs (or two protocols) have field names or method names in common, that doesn't necessarily mean that those fields or methods mean the same thing. The overlap could be merely coincidental. (Indeed, the type signatures of the fields/methods could be drastically different.) In contrast, if two structs share a common protocol, then the two implementations of that protocol do mean the same thing. (Because the purpose of a protocol is to represent a particular meaning.) Thus, it makes sense that the protocol can be invoked.
Is it possible to compile union types as efficiently as nominal enums?
- I suspect that the answer is "yes", at least in cases where union types are used in the same manner as enums. For example, if one defines the union type type Color = Red | Green | Blue (where Red, Green, and Blue are pre-defined struct types with no fields), there is no reason that the compiler can't encode them via the integers 0, 1 and 2 when storing them in variables of type Color and when passing them as arguments of type Color. A difficulty arises when the same structs are used in multiple enums, e.g. type TrafficLight = Red | Yellow | Green. When these two types "interact" (e.g. a Color value is used as a TrafficLight argument), re-mapping of the tag values may need to occur, e.g. Green gets mapped from 1 to 2. Alternatively, the compiler can avoid the need for such re-mappings by monomorphizing function calls. For example, given a function with an argument of type TrafficLight, the compiler could generate a "special" version of the function where the tag of each TrafficLight color is expected to match the tag for the respective Color color (i.e. Green is encoded as 1). Call sites that provide a Color as an argument will invoke this version of the function.
If the elements of a type union are structs, rather than enum-style "variants" (known as data constructors in FP), how would pattern-matching and destructuring work?
- The best approach is probably to pattern-match directly on structs. No notion of a "constructor" is required. This is actually wonderfully consistent with Python: in Python, pattern matching is performed directly on objects.
- But how would pattern matching on structs work? Not all structs are "plain old data": many __init__ methods are not invertible. Thus, we would probably need a keyword or decorator to designate that a struct is destructurable. The @value decorator that Mojo offers could be used for this. In fact, Python's @dataclass decorator already works this way for classes. A more general approach would be to allow structs to define their own destructuring behaviour — analogous to how __getitem__ works, but for destructuring rather than indexing.
- Structs that are not destructurable can still be used in pattern-matching, as long as no attempt is made to destructure them. So instead of writing if x is Foo(a, b, c)... (a hypothetical syntax for destructuring x), one would write x is Foo, which merely tests whether x is an instance of the struct Foo. This test would be implemented in the same efficient manner as enums, as discussed earlier.
Union types are typically understood to work like sets, meaning that duplicate occurrences of a type are eliminated. For example, None | None = None. How should unification be treated?
- As a case study, one might try to define the type alias Optional[T] as follows: type Optional[T] = T | None However, this definition suffers from the issue (?) that Optional[None] is just None, and that Optional[Optional[Int]] is just Optional[Int]. This seems hard to reason about!
- The "nuclear option" to addressing this issue would be to treat type expressions whose alternatives are able to unify — i.e. "merge together" — as ill-constructed. So the above type alias would be disallowed, and users would have to write a definition like type Optional[T] = Some[T] | None instead (where Some is a pre-defined struct). It seems reasonable to impose this restriction, given that it remains strictly more expressive than traditional enums.
  - Possibly, the only thing that needs to be restricted is the unification of a type parameter with something. None | None doesn't seem so nefarious — and such unification will obviously happen during type inference.

This is probably enough detail for the moment. What does the Mojo team think of this approach?

lsh commented 1 year ago

For what it's worth, "Algebraic data types like ‘enum’ in Swift/Rust" is already on the roadmap.

lattner commented 1 year ago

Yes, this is a very important feature, also very important to Swift and many ML languages. Doing this right is super important. Thank you for filing this!

nmsmith commented 1 year ago

It's worth clarifying that my proposal is to avoid the design that Swift, Rust, and ML have implemented, owing to its lack of Modularity. 🥁

What are the next steps for this feature? Will Modular be sharing more thoughts in the future?

nmn commented 1 year ago

So looking at the original Issue and the responses a think there has been some confusion.

Support for Structural Union Types

@nmsmith is asking for Structural Union Types instead of Nominal Algebraic types such as in Swift and Rust. (Optionals in Swift are an exception)

The Good

This would be a big improvement in developer experience because you no longer need to “wrap” your types in enums when you are choosing one of multiple types.

The Bad

Usually, supporting true structural unwrapped union types means more runtime overhead as you need the ability to determine the actual type at runtime.

Nominal enums, on the other hand need minimal overhead (essentially, just a numberic “tag”) to determine which “case” it is.

Both Unions and Algebraic Data Type Enums should probably exist in the language

Mojo is an interesting case since it aims to be a superset of Python. Since Python is a dynamic language, all types in it can be determined at runtime. A structural Union type makes sense for this.

On the other hand, Mojo also provides primitives such as struct for the maximum possible performance.

Keeping this in mind, structural types and union types make sense for the “Python, but faster case”. But when you want to eek out the best possible performance from your program, you’d probably want to move from Structural Union types to Nominal Enums.

nmsmith commented 1 year ago

@nmn Yes, you've summarized the distinction well. Thanks for that.

I addressed the topic of runtime overhead in my original post, under the following bullet point:

Is it possible to compile union types as efficiently as nominal enums?

To summarize it: I believe that if you're using a structural union type in the same manner that you'd be using an enum, you can compile it in exactly the same way — i.e. efficiently, using numeric tags. (Why shouldn't that be the case? If the code is isomorphic to the code for enums, the compiler can use the same compilation strategy.)

If this is true, there would be no reason to support nominal enums in Mojo. They would be entirely redundant.

Also, it's worth demonstrating that translating any enum to a union type is trivial. Consider the following enum:

enum Foo {
    A(Int),
    B(Int, Int),
}

This can be re-written as a union type as follows. (Let's assume that our hypothetical programming language has a syntax for "tuple structs", in the style of Rust.)

struct A(Int)
struct B(Int, Int)

type Foo = A | B       // This is just a type ALIAS.

Notably, this requires the same number of lines of code as the enum declaration. The advantage is, of course, that structs A and B can thereafter be used in any number of unions:

type Bar = A | B | C
type Qux = C | A

lsh commented 1 year ago

I agree that many error types can be tough, but I would also caution against giant catch all error types. Sabrina Jewson has a great article describing the pitfalls of multierror unions.

nmn commented 1 year ago

I believe that if you're using a structural union type in the same manner that you'd be using an enum, you can compile it in exactly the same way — i.e. efficiently, using numeric tags.

This is not as simple as you suggest. But in order to not pollute this issue anymore I'm going to not go into detail here. TLDR; you'd need a unique tag for every type for what you're describing to be doable. You may argue that that is worth it, but it is overhead.

nmsmith commented 1 year ago

I'm assuming you mean each struct would need to have a globally-unique tag. In my original post I argued that you wouldn't need to do this. The tag that is associated with a particular member of a union can be chosen according to how the union is ultimately used throughout the program. If the union is used just like a normal enum — i.e. if you don't reuse those same structs in other unions — then they can be given the same tags that enums are given. And even if you do have overlapping unions, they can still be compiled like enums, as long as a struct value isn't passed between pieces of code that tag things differently. If a struct is used in that way, then the compiler would need to either re-map the tags at the point where a struct moves between unions, or monomorphize the program (the same trick used for generics) such that no remapping is necessary.

nmn commented 1 year ago

@nmsmith I know what you argued. I had written a full reply why that wouldn't be feasible but it got long so I decided against it.

You've generally already mentioned the problem I would describe. You'd need to know how the type is used during all execution. This is possible for small programs, but not when you start building libraries, partial re-compilation etc.

or monomorphize the program (the same trick used for generics) such that no remapping is necessary.

There's a bunch of complexity you're papering over here as well, but let's not get into that!

nmsmith commented 1 year ago

There are some challenges, but I think they are surmountable. It's just another optimization task — compilers are full of those. (Traditional enums are already subjected to various kinds of optimization.)

Of course, I'm not the one writing the compiler 😄.

lattner commented 1 year ago

What are the next steps for this feature? Will Modular be sharing more thoughts in the future?

We're not working on this in the immediate future, but when it comes up, we'll update it with thoughts. To clarify, I wasn't suggesting necessarily that they'd be nominal. I agree that making them be anonymous types would be nice to reduce a concept.

hardskulls commented 9 months ago

Are there some hard problems with this design? I saw something very similar in roc language. The author claims it's as efficient as in rust.

nmn commented 9 months ago

roc does have anonymous union types that work in a very interesting way.

List[A | B | C] is a more general type than List[A | B]. Thus, the cast is infallible.

This is only true for an immutable list. A mutable list would be invariant.

hardskulls commented 9 months ago

roc does have anonymous union types that work in a very interesting way.

List[A | B | C] is a more general type than List[A | B]. Thus, the cast is infallible.

This is only true for an immutable list. A mutable list would be invariant.

Oh, now I get it... Thanks!

nmsmith commented 7 months ago

I've had an epiphany recently. We might be able to model unions and trait objects really cleanly using dependent types. For this approach to work, we'd need the ability to:

Store type-values at runtime.
Pass around traits as values.

A Rust-style "trait object" could be modelled via a definition such as the following:

struct Object[Trait: AnyTrait]:
    var Type: Trait  # store the type at runtime
    var value: Type
    fn __init__[Type: Trait](inout self, value: Type):
        self.Type = Type
        self.value = value

Caveat: The above definition implies that value would be stored inline. Given that its type is not known until runtime, this would make Object[SomeTrait] a "dynamically-sized type". This would prevent instances from being stored in an array, which is not ideal. In practice, we'd probably want to store value indirectly, and access it via a pointer.

This type could be used as follows (ignoring the sizing issue):

# Note: The arguments here are being implicitly converted to `Object[Stringable]`
# using the standard rules for type conversion.
var items = Array[Object[Stringable]](Int(5), String("hello"))

for item in items:
    if item.Type is Int:
        # Now `item.value` can be treated as an `Int`
        item.value += 1

Now, a trait object is different from an enum, because the set of members of an enum is closed. To model an enum, we'd need a way to specify that a type can only take on a value from a fixed set. I'll use the syntax var Type: in iterable for this. The idea is that iterable would be a parameter expression whose value is iterable, and the compiler can obtain the set of alternatives by iterating through that set. The compiler can hopefully use this information to do exhaustiveness checking in match statements.

To define a struct that models enums, we'll need it to accept a variadic number of types. The following definition seems plausible:

struct Enum[*Types: AnyType]:
    var Type: in Types
    var value: Type
    fn __init__[Type: in Types](inout self, value: Type):
        self.Type = Type
        self.value = value

This data type can be used in the exact same manner as the Object type:

# Note: The arguments here are being implicitly converted to `Enum[Int, String]`
# using the standard rules for type conversion.
var items = Array[Enum[Int, String]](Int(5), String("hello"))

for item in items:
    if item.Type is Int:
        # Now `item.value` can be treated as an `Int`
        item.value += 1

This approach could be really promising. The beautiful thing about this approach is that it doesn't require enums/unions to be a language feature. Instead, they can be defined as a library, built on top of dependent types.

Note that the in Types syntax basically replicates the functionality of C-style enums—i.e. enums without a payload—sans the encoding as integers. So possibly the right way to think about the design is in terms of two layers:

Enums without a payload (var thing: in iterable)
Rust/Swift style enums with a payload (a.k.a. "tagged unions" or "algebraic data types")

The second construct is defined using the first. Obviously we can't call them both "enums", so appropriate names would need to be chosen. So to flesh out this design, we would first need to know how the enums-without-payload should work.

I also haven't mentioned anything about subtyping, e.g. how/whether Enum[A, B] can be used a context that expects an Enum[A, B, C]. That requires considering variance etc so it's quite a tricky topic. 🙂

Mogball commented 7 months ago

While I agree it would be really nice to implement these as purely library types, in practice I don't think we will without some compiler magic.

The first is the use of dynamic type values in type annotations. I think this is well-formed (dispatch dynamic queries into the runtime type value instead of statically), but I wouldn't put it on the critical path. Also it would have to be a boxed value, so Pointer[None] is probably fine.

The second is that the struct has to magically implement all the methods of the trait. Our metaprogramming is not good enough to express something like this. This is where compiler magic needs to come in for now.

Also, I think existentials should be spelled with Any[T] and object[T] is slightly different in that it contains an Any[T] but dynamically dispatches any methods, like Python objects.

The last thing is that with a first class IR representation, we can optimize dynamic dispatch away trivially.

nmsmith commented 7 months ago

the struct has to magically implement all the methods of the trait. Our metaprogramming is not good enough to express something like this. This is where compiler magic needs to come in for now.

The same problem occurs for Reference, right? e.g. for some type T: Fooable, it would be nice for a List[Reference[T]] to be passable to a function that expects List[some Fooable].

I can envisage a solution that works for both Reference and Any. Both of these data types are just "proxies" for another variable, so it makes sense that they can "inherit" the traits of their target variable. (Except perhaps for some of the built-in traits, e.g. Movable, where inheritability doesn't make sense.)

We can possibly integrate this idea with the notion of "dereferencing" (__refitem__). For example, any struct that can be dereferenced could inherit the traits of its referent—assuming that it doesn't implement those traits itself.

The first is the use of dynamic type values in type annotations. I think this is well-formed (dispatch dynamic queries into the runtime type value instead of statically).

For trait objects (Object/Any), yes, I'd imagine that would be the implementation strategy. But for the Enum type above, given that enums are limited to a statically-known set of alternatives, I'd argue that the type field should be encoded as an integer indicating which of the alternatives it is (for space reasons), and then:

Struct methods can be dispatched on value by offering flow-sensitive narrowing of the type field (in the style of TypeScript), and then doing static dispatch on whatever the type has been narrowed down to within a particular branch.
Trait methods can be dispatched on value in any context where the type has been narrowed to a point where all remaining alternatives implement the trait. To obtain a pointer to the actual table of trait methods, you'd need to add a bit of generated code that switches on the type field (the integer) and grabs a pointer to the correct table.

More generally, any value with type in <iterable> can be:

encoded as an integer (a position in the iterable)
narrowed via if and match
subjected to "exhaustiveness checking" in a match statement

nmsmith commented 6 months ago

(Below is just a brain dump. It's by no means a final verdict.)

I'm starting to suspect that storing types in runtime variables, e.g. var Type: AnyType, doesn't really make sense. In particular, while it might be possible to store a type in a local variable or a field, I'm not sure that it makes sense to store a type in a generic collection, because every such collection is defined in terms of a type variable with a trait bound, e.g. T: CollectionElement, which means that the collection can only store instances of types. Therefore, the only way to allow collections to store types would be to say that all types are instances of some other type, i.e. the type of an array that stores other types would be Array[Type], or Array[Struct]. But that raises all sorts of thorny questions about type universes. Maybe there's a way to do this cleanly—I'm not sure.

Storing types in vars would also raise the question of whether types should be value-semantic, or reference-semantic. One wouldn't expect writing var A = Int to create a new type A with a completely different identity.

To avoid all this complexity, we might be able to store types in runtime variables indirectly, via zero-size objects and some kind of variant/enum feature. For example:

struct Type[Value: AnyType]:
    pass
var SomeType: Variant[Type[Int], Type[String]] = ...

(I'm not sure how Variant itself would be defined, though.)

felix-andreas commented 5 months ago

Another approach to union types - which also avoids Rust's problem of overbroad errors types (e.g. any file system operation might return an io::Error of kind AddrInUse) - are Roc's tags. (For more see this talk)

It's essentially a middle ground between Typescript and Rust: The type is structural but variants are still nominal (or at least they have names): You don't have to declare a type first, but instead you can just make up tags on the fly and errors accumulate automatically (like in TypeScript), but it is still straightforward to pattern-match on tags (like in Rust).

In TypeScript the way to usually discriminate between different variants of a union is to check if certain fields are present or not (which sometimes can be cumbersome and error-prone).

bomzj commented 4 months ago

Look at Haskell, F#, Rescript, no need to reinvent or copy/paste bad habbits from Rust/Typescript.

hardskulls commented 1 month ago

Good news everyone!

I asked Roc's devs if it was possible to use Roc's tags in for a low-level language, and they confirmed that yeah, it's 100 percent possible, and their implementation is a efficient as that of Rust. Turns out that no one else is using it not because it was impossible, but because they are the first to do that sort of experiment (which turned out to be successful).

Link

nmsmith commented 1 month ago

@hardskulls as Richard points out on that thread, Roc's tags are basically the same as OCaml's polymorphic variants. So they're not necessarily anything novel.

That said, polymorphic variants are a point in the design space worth considering.

hardskulls commented 1 month ago

@hardskulls as Richard points out on that thread, Roc's tags are basically the same as OCaml's polymorphic variants. So they're not necessarily anything novel.

That said, polymorphic variants are a point in the design space worth considering.

I thought he meant it looks the same, but differs in implementation. I looked at OCaml's docs and it's mentioned that polymorphic variants do carry a little overhead. But Richard compares it to Rust in terms of performance and efficiency in his talks.

modularml / mojo