modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
22.82k stars 2.58k forks source link

[Feature Request] Implement union types (rather than nominal enums) #43

Open nmsmith opened 1 year ago

nmsmith commented 1 year ago

Summary

I propose to consider adding TypeScript-style union types to Mojo — instead of Rust-style nominal enums. This approach seems like it would work well given that Mojo plans to extend Python, which is a dynamically-typed language. In addition to enabling type declarations for Python libraries, union types would allow Mojo users to avoid the modelling issues associated with nominal enums.

Union types are strictly more general than nominal enums, and so have the potential to make enums obsolete.

Update Jan 2024: I removed a section that was too speculative, and I posted new thoughts on how enums/unions might be modelled using dependent types. This approach would supercede most of the design details discussed in this original post.

Motivation

Advantage 1: Specifying precise return types

Most statically-typed languages that implement type-safe enums use nominal typing. For example, in Rust, to define a function that can return two kinds of error, one must first define (and name) an enumeration of those error kinds:

enum IOParseError {
    IOError(...),
    ParseError(...),
}

One can then specify the return type of the error-producing function as follows:

fn parseFile(file: &str) -> Result<Success, IOParseError>

This approach doesn't scale when different functions can return different combinations of errors. For example, maybe we have:

The issue here is that each combination of errors requires a different enum to be declared, and each enum needs to have unique constructor names. Ultimately, this makes expressing these return types using nominal enums a nightmare. (And this problem isn't unique to error types. There are many other scenarios in which this issue occurs.)

In contrast, modelling these functions in a dynamically-typed language (such as Python) is trivial. For example, the implementation of parseFile will simply return whatever error object is appropriate. No up-front declaration is required.

Because of this flexibility afforded by dynamically-typed languages, retroactively adding a type system to such languages has proved challenging. But in recent years, TypeScript has demonstrated that it is possible to statically-type these kinds of programs. TypeScript's trick is to offer union types, via a union operator |. The union operator can be used to directly express a disjunction of types. We can use the union operator to directly enumerate the specific errors that the aforementioned functions can return:

function parseFile():           Result<Success, IOError | ParseError> {...}
function readFile():            Result<Success, IOError> {...}
function parseString():         Result<Success, ParseError> {...}
function parseAndExecuteFile(): Result<Success, IOError | ParseError | RuntimeError> {...}

For brevity, I will omit a full description of how TypeScript's union types work. For more information, check out the relevant section of the TypeScript handbook.

Advantage 2: Static type-checking (and optimization) of Python code

Given that Python is a dynamically-typed language, Python programmers mix values of different types together in myriad ways. For example, the use of heterogeneous collections is common amongst Python programmers.

Accordingly, it would be beneficial if Mojo's static semantics could have a richer understanding of Python, so that more Python code is able to "just work" when copy-pasted into Mojo — even if the code is pasted into a statically-checked fn.

Notably, Python already supports the use of union types as part of its type annotations feature. However, Python assigns no static semantics to these annotations. In this post, I'm arguing that giving union types a rigorous static semantics will enable Mojo users to construct expressive, statically-typed code that can be compiled very efficiently.

Design

Developing a precise semantics for union types will require a large amount of work. Thankfully, TypeScript is a helpful piece of prior art. Scala 3 has union types as well. But as I explain below, I think we would want to differ in the fine details.

I'm unable to dedicate the time required to flesh out a full design proposal. However, I can skim over some potential questions and challenges that may arise:

This is probably enough detail for the moment. What does the Mojo team think of this approach?

lsh commented 1 year ago

For what it's worth, "Algebraic data types like ‘enum’ in Swift/Rust" is already on the roadmap.

lattner commented 1 year ago

Yes, this is a very important feature, also very important to Swift and many ML languages. Doing this right is super important. Thank you for filing this!

nmsmith commented 1 year ago

It's worth clarifying that my proposal is to avoid the design that Swift, Rust, and ML have implemented, owing to its lack of Modularity. 🥁

What are the next steps for this feature? Will Modular be sharing more thoughts in the future?

nmn commented 1 year ago

So looking at the original Issue and the responses a think there has been some confusion.

Support for Structural Union Types

@nmsmith is asking for Structural Union Types instead of Nominal Algebraic types such as in Swift and Rust. (Optionals in Swift are an exception)

The Good

This would be a big improvement in developer experience because you no longer need to “wrap” your types in enums when you are choosing one of multiple types.

The Bad

Usually, supporting true structural unwrapped union types means more runtime overhead as you need the ability to determine the actual type at runtime.

Nominal enums, on the other hand need minimal overhead (essentially, just a numberic “tag”) to determine which “case” it is.

Both Unions and Algebraic Data Type Enums should probably exist in the language

Mojo is an interesting case since it aims to be a superset of Python. Since Python is a dynamic language, all types in it can be determined at runtime. A structural Union type makes sense for this.

On the other hand, Mojo also provides primitives such as struct for the maximum possible performance.

Keeping this in mind, structural types and union types make sense for the “Python, but faster case”. But when you want to eek out the best possible performance from your program, you’d probably want to move from Structural Union types to Nominal Enums.

nmsmith commented 1 year ago

@nmn Yes, you've summarized the distinction well. Thanks for that.

I addressed the topic of runtime overhead in my original post, under the following bullet point:

Is it possible to compile union types as efficiently as nominal enums?

To summarize it: I believe that if you're using a structural union type in the same manner that you'd be using an enum, you can compile it in exactly the same way — i.e. efficiently, using numeric tags. (Why shouldn't that be the case? If the code is isomorphic to the code for enums, the compiler can use the same compilation strategy.)

If this is true, there would be no reason to support nominal enums in Mojo. They would be entirely redundant.

Also, it's worth demonstrating that translating any enum to a union type is trivial. Consider the following enum:

enum Foo {
    A(Int),
    B(Int, Int),
}

This can be re-written as a union type as follows. (Let's assume that our hypothetical programming language has a syntax for "tuple structs", in the style of Rust.)

struct A(Int)
struct B(Int, Int)

type Foo = A | B       // This is just a type ALIAS.

Notably, this requires the same number of lines of code as the enum declaration. The advantage is, of course, that structs A and B can thereafter be used in any number of unions:

type Bar = A | B | C
type Qux = C | A
lsh commented 1 year ago

I agree that many error types can be tough, but I would also caution against giant catch all error types. Sabrina Jewson has a great article describing the pitfalls of multierror unions.

nmn commented 1 year ago

I believe that if you're using a structural union type in the same manner that you'd be using an enum, you can compile it in exactly the same way — i.e. efficiently, using numeric tags.

This is not as simple as you suggest. But in order to not pollute this issue anymore I'm going to not go into detail here. TLDR; you'd need a unique tag for every type for what you're describing to be doable. You may argue that that is worth it, but it is overhead.

nmsmith commented 1 year ago

I'm assuming you mean each struct would need to have a globally-unique tag. In my original post I argued that you wouldn't need to do this. The tag that is associated with a particular member of a union can be chosen according to how the union is ultimately used throughout the program. If the union is used just like a normal enum — i.e. if you don't reuse those same structs in other unions — then they can be given the same tags that enums are given. And even if you do have overlapping unions, they can still be compiled like enums, as long as a struct value isn't passed between pieces of code that tag things differently. If a struct is used in that way, then the compiler would need to either re-map the tags at the point where a struct moves between unions, or monomorphize the program (the same trick used for generics) such that no remapping is necessary.

nmn commented 1 year ago

@nmsmith I know what you argued. I had written a full reply why that wouldn't be feasible but it got long so I decided against it.

You've generally already mentioned the problem I would describe. You'd need to know how the type is used during all execution. This is possible for small programs, but not when you start building libraries, partial re-compilation etc.

or monomorphize the program (the same trick used for generics) such that no remapping is necessary.

There's a bunch of complexity you're papering over here as well, but let's not get into that!

nmsmith commented 1 year ago

There are some challenges, but I think they are surmountable. It's just another optimization task — compilers are full of those. (Traditional enums are already subjected to various kinds of optimization.)

Of course, I'm not the one writing the compiler 😄.

lattner commented 1 year ago

What are the next steps for this feature? Will Modular be sharing more thoughts in the future?

We're not working on this in the immediate future, but when it comes up, we'll update it with thoughts. To clarify, I wasn't suggesting necessarily that they'd be nominal. I agree that making them be anonymous types would be nice to reduce a concept.

hardskulls commented 9 months ago

Are there some hard problems with this design? I saw something very similar in roc language. The author claims it's as efficient as in rust.

nmn commented 9 months ago

roc does have anonymous union types that work in a very interesting way.

List[A | B | C] is a more general type than List[A | B]. Thus, the cast is infallible.

This is only true for an immutable list. A mutable list would be invariant.

hardskulls commented 9 months ago

roc does have anonymous union types that work in a very interesting way.

List[A | B | C] is a more general type than List[A | B]. Thus, the cast is infallible.

This is only true for an immutable list. A mutable list would be invariant.

Oh, now I get it... Thanks!

nmsmith commented 7 months ago

I've had an epiphany recently. We might be able to model unions and trait objects really cleanly using dependent types. For this approach to work, we'd need the ability to:

  1. Store type-values at runtime.
  2. Pass around traits as values.

A Rust-style "trait object" could be modelled via a definition such as the following:

struct Object[Trait: AnyTrait]:
    var Type: Trait  # store the type at runtime
    var value: Type
    fn __init__[Type: Trait](inout self, value: Type):
        self.Type = Type
        self.value = value

Caveat: The above definition implies that value would be stored inline. Given that its type is not known until runtime, this would make Object[SomeTrait] a "dynamically-sized type". This would prevent instances from being stored in an array, which is not ideal. In practice, we'd probably want to store value indirectly, and access it via a pointer.

This type could be used as follows (ignoring the sizing issue):

# Note: The arguments here are being implicitly converted to `Object[Stringable]`
# using the standard rules for type conversion.
var items = Array[Object[Stringable]](Int(5), String("hello"))

for item in items:
    if item.Type is Int:
        # Now `item.value` can be treated as an `Int`
        item.value += 1

Now, a trait object is different from an enum, because the set of members of an enum is closed. To model an enum, we'd need a way to specify that a type can only take on a value from a fixed set. I'll use the syntax var Type: in iterable for this. The idea is that iterable would be a parameter expression whose value is iterable, and the compiler can obtain the set of alternatives by iterating through that set. The compiler can hopefully use this information to do exhaustiveness checking in match statements.

To define a struct that models enums, we'll need it to accept a variadic number of types. The following definition seems plausible:

struct Enum[*Types: AnyType]:
    var Type: in Types
    var value: Type
    fn __init__[Type: in Types](inout self, value: Type):
        self.Type = Type
        self.value = value

This data type can be used in the exact same manner as the Object type:

# Note: The arguments here are being implicitly converted to `Enum[Int, String]`
# using the standard rules for type conversion.
var items = Array[Enum[Int, String]](Int(5), String("hello"))

for item in items:
    if item.Type is Int:
        # Now `item.value` can be treated as an `Int`
        item.value += 1

This approach could be really promising. The beautiful thing about this approach is that it doesn't require enums/unions to be a language feature. Instead, they can be defined as a library, built on top of dependent types.

Note that the in Types syntax basically replicates the functionality of C-style enums—i.e. enums without a payload—sans the encoding as integers. So possibly the right way to think about the design is in terms of two layers:

  1. Enums without a payload (var thing: in iterable)
  2. Rust/Swift style enums with a payload (a.k.a. "tagged unions" or "algebraic data types")

The second construct is defined using the first. Obviously we can't call them both "enums", so appropriate names would need to be chosen. So to flesh out this design, we would first need to know how the enums-without-payload should work.

I also haven't mentioned anything about subtyping, e.g. how/whether Enum[A, B] can be used a context that expects an Enum[A, B, C]. That requires considering variance etc so it's quite a tricky topic. 🙂

Mogball commented 7 months ago

While I agree it would be really nice to implement these as purely library types, in practice I don't think we will without some compiler magic.

The first is the use of dynamic type values in type annotations. I think this is well-formed (dispatch dynamic queries into the runtime type value instead of statically), but I wouldn't put it on the critical path. Also it would have to be a boxed value, so Pointer[None] is probably fine.

The second is that the struct has to magically implement all the methods of the trait. Our metaprogramming is not good enough to express something like this. This is where compiler magic needs to come in for now.

Also, I think existentials should be spelled with Any[T] and object[T] is slightly different in that it contains an Any[T] but dynamically dispatches any methods, like Python objects.

The last thing is that with a first class IR representation, we can optimize dynamic dispatch away trivially.

nmsmith commented 7 months ago

the struct has to magically implement all the methods of the trait. Our metaprogramming is not good enough to express something like this. This is where compiler magic needs to come in for now.

The same problem occurs for Reference, right? e.g. for some type T: Fooable, it would be nice for a List[Reference[T]] to be passable to a function that expects List[some Fooable].

I can envisage a solution that works for both Reference and Any. Both of these data types are just "proxies" for another variable, so it makes sense that they can "inherit" the traits of their target variable. (Except perhaps for some of the built-in traits, e.g. Movable, where inheritability doesn't make sense.)

We can possibly integrate this idea with the notion of "dereferencing" (__refitem__). For example, any struct that can be dereferenced could inherit the traits of its referent—assuming that it doesn't implement those traits itself.

The first is the use of dynamic type values in type annotations. I think this is well-formed (dispatch dynamic queries into the runtime type value instead of statically).

For trait objects (Object/Any), yes, I'd imagine that would be the implementation strategy. But for the Enum type above, given that enums are limited to a statically-known set of alternatives, I'd argue that the type field should be encoded as an integer indicating which of the alternatives it is (for space reasons), and then:

More generally, any value with type in <iterable> can be:

nmsmith commented 6 months ago

(Below is just a brain dump. It's by no means a final verdict.)

I'm starting to suspect that storing types in runtime variables, e.g. var Type: AnyType, doesn't really make sense. In particular, while it might be possible to store a type in a local variable or a field, I'm not sure that it makes sense to store a type in a generic collection, because every such collection is defined in terms of a type variable with a trait bound, e.g. T: CollectionElement, which means that the collection can only store instances of types. Therefore, the only way to allow collections to store types would be to say that all types are instances of some other type, i.e. the type of an array that stores other types would be Array[Type], or Array[Struct]. But that raises all sorts of thorny questions about type universes. Maybe there's a way to do this cleanly—I'm not sure.

Storing types in vars would also raise the question of whether types should be value-semantic, or reference-semantic. One wouldn't expect writing var A = Int to create a new type A with a completely different identity.

To avoid all this complexity, we might be able to store types in runtime variables indirectly, via zero-size objects and some kind of variant/enum feature. For example:

struct Type[Value: AnyType]:
    pass
var SomeType: Variant[Type[Int], Type[String]] = ...

(I'm not sure how Variant itself would be defined, though.)

felix-andreas commented 5 months ago

Another approach to union types - which also avoids Rust's problem of overbroad errors types (e.g. any file system operation might return an io::Error of kind AddrInUse) - are Roc's tags. (For more see this talk)

It's essentially a middle ground between Typescript and Rust: The type is structural but variants are still nominal (or at least they have names): You don't have to declare a type first, but instead you can just make up tags on the fly and errors accumulate automatically (like in TypeScript), but it is still straightforward to pattern-match on tags (like in Rust).

In TypeScript the way to usually discriminate between different variants of a union is to check if certain fields are present or not (which sometimes can be cumbersome and error-prone).

bomzj commented 4 months ago

Look at Haskell, F#, Rescript, no need to reinvent or copy/paste bad habbits from Rust/Typescript.

hardskulls commented 1 month ago

Good news everyone!

I asked Roc's devs if it was possible to use Roc's tags in for a low-level language, and they confirmed that yeah, it's 100 percent possible, and their implementation is a efficient as that of Rust. Turns out that no one else is using it not because it was impossible, but because they are the first to do that sort of experiment (which turned out to be successful).

Link

nmsmith commented 1 month ago

@hardskulls as Richard points out on that thread, Roc's tags are basically the same as OCaml's polymorphic variants. So they're not necessarily anything novel.

That said, polymorphic variants are a point in the design space worth considering.

hardskulls commented 1 month ago

@hardskulls as Richard points out on that thread, Roc's tags are basically the same as OCaml's polymorphic variants. So they're not necessarily anything novel.

That said, polymorphic variants are a point in the design space worth considering.

I thought he meant it looks the same, but differs in implementation. I looked at OCaml's docs and it's mentioned that polymorphic variants do carry a little overhead. But Richard compares it to Rust in terms of performance and efficiency in his talks.