rust-lang / rfcs

RFCs for changes to Rust
https://rust-lang.github.io/rfcs/
Apache License 2.0
5.94k stars 1.57k forks source link

Anonymous sum types #294

Open rust-highfive opened 10 years ago

rust-highfive commented 10 years ago

Issue by glaebhoerl Saturday Aug 03, 2013 at 23:58 GMT

For earlier discussion, see https://github.com/rust-lang/rust/issues/8277

This issue was labelled with: B-RFC in the Rust repository


Rust has an anonymous form of product types (structs), namely tuples, but not sum types (enums). One reason is that it's not obvious what syntax they could use, especially their variants. The first variant of an anonymous sum type with three variants needs to be syntactically distinct not just from the second and third variant of the same type, but also from the first variant of all other anonymous sum types with different numbers of variants.

Here's an idea I think is decent:

A type would look like this: (~str|int|int). In other words, very similar to a tuple, but with pipes instead of commas (signifying or instead of and).

A value would have the same shape (also like tuples), with a value of appropriate type in one of the "slots" and nothing in the rest:

let foo: (~str|int|int) = (!|!|666);
match foo {
    (s|!|!) => println(fmt!("string in first: %?", s)),
    (!|n|!) => println(fmt!("int in second: %?", n)),
    (!|!|m) => println(fmt!("int in third: %?", m))
} 

(Nothing is a bikeshed, other possible colors for it include whitespace, ., and -. _ means something is there we're just not giving it a name, so it's not suitable for "nothing is there". ! has nothing-connotations from the negation operator and the return type of functions that don't.)

I'm not sure whether this conflicts syntax-wise with closures and/or negation.

Another necessary condition for this should be demand for it. This ticket is to keep a record of the idea, in case someone else has demand but not syntax. (If the Bikesheds section of the wiki is a better place, I'm happy to move it.)

SEE ALSO

Keavon commented 3 months ago

Thanks for laying that out clearly @Rufflewind, that's easy to understand.

Perhaps I come from a biased perspective as a TypeScript user with only a cursory understanding of type theory, but to me option 1 seems like the obvious, no-brainer, only good solution. But I must be missing the perspective to make a fair comparison. Are there downsides to option 1, and upsides to option 2? Is there a compelling reason to consider option 2 or are people just discussing it because it's technically the name of this issue?

programmerjake commented 3 months ago

Perhaps I come from a biased perspective as a TypeScript user with only a cursory understanding of type theory, but to me option 1 seems like the obvious, no-brainer, only good solution. But I must be missing the perspective to make a fair comparison. Are there downsides to option 1,

yes there are downsides, imo critically so for Rust, since Rust has lifetimes where each different lifetime makes references different types, but Rust is intentionally designed such that when generating the final compiled code, it never knows what lifetimes anything has (lifetimes have been erased). This means that using the TypeScript-style Union approach can't work for the following code:

fn do_match<T, F>(v: <T | F>) -> Result<T, F> {
    // this requires Rust to be able to distinguish T and F when generating code for this function,
    // which happens after lifetimes are erased. So, if `T = &'a u8` and `F = &'b u8`,
    // then when Rust is generating code for this function it sees
    // `T = &'erased u8` and `F = &'erased u8`, so are they the same type or not?
    //
    // If Rust decides they're the same type, then does Rust return Ok?
    // if so, then this can be used to convert one lifetime `'b` into another
    // lifetime `'a` in safe code, which is unsound since `'a` could live for
    // much longer than `'b` so code can use memory after it has been freed,
    // which is Undefined Behavior (aka. rustc is allowed to generate code that
    // does anything at all, such as crash, overwrite unrelated memory, delete
    // all your code, do exactly what you wanted, format your harddrive, etc.).
    // If Rust decides to return Err instead, you can still convert lifetimes,
    // just the other way around, so it's just as bad.
    //
    // If Rust decides they're not the same type, then when some other
    // function tries to call this function where `'a` and `'b` are actually the
    // same lifetime, then the caller will think it's passing a TypeScript-style
    // Union type with only one variant, but `do_match`'s code will have been
    // generated for a TypeScript-style Union type with two variants, which is
    // also unsound and Undefined Behavior!
    match v {
        T(t) => Ok(t),
        F(f) => Err(f),
    }
}
Keavon commented 3 months ago

Interesting... I suppose there will always have to be some level of compromises but it's a matter of picking the least bad ones, while also ensuring that we don't use small downsides as a reason against even attempting the larger gains. (The "but sometimes..." fallacy coined by Technology Connections in his video about LED stoplights being worse at melting snow so people don't want to switch to them at all, or engineer mitigations.)

So I suppose I have a couple questions:

Luca-spopo commented 3 months ago

I think two entirely unrelated features are being talked about concurrently

One would be just syntactic sugar to make anonymous "sum" types behave the same way as enums do currently. I would like to see this in the language.

My understanding:

The second features being talked about are a sum type (I will call it Union type).

My understanding:

I think what's missing for this second feature is making traits closed to extension. Right now we can have a trait U and two conforming structs A and B, but we cannot be 100% certain (usually) that A and B are the only structs that can conform to this trait. We could abuse the orphan rules to make this the case today, but that's not a nice way to do it, and I dont think the compiler will let us exploit the knowledge that A and B are the only structs that implement trait U.

Just a syntax to say "Hey, the only structs allowed to implement this trait are this list I am defining" should be able to serve the niche of a union type. We should be able to declare that dyn& U is going to be either an A or a B because there's no other possibilities.

kaivol commented 3 months ago

Rust has lifetimes where each different lifetime makes references different types, but Rust is intentionally designed such that when generating the final compiled code, it never knows what lifetimes anything has (lifetimes have been erased).

@programmerjake Aren't all types effectively erased at runtime? To distinguish lifetimes we would use the same mechanism as to distinguish types (so probably a discriminant).

If Rust decides they're not the same type, then when some other function tries to call this function where 'a and 'b are actually the same lifetime, then the caller will think it's passing a TypeScript-style Union type with only one variant, but do_match's code will have been generated for a TypeScript-style Union type with two variants, which is also unsound and Undefined Behavior!

I'm not exactly familiar with TypeScript unions, so could you expand on this? What exactly would cause Undefined Behavior here?


Beyond that, I agree that anonymous sum types including generics are a bit awkward. It probably shouldn't even be possible to match on Generic type parameters, but only on their bounds.

8573 commented 3 months ago

a sum type (I will call it Union type). ... Union<A, A> indistinguishable from A

This is a union type but not a sum type: a characteristic feature of a sum type A+B is that its number of possible values is the sum of the numbers of possible values of the summed types A and B, even if A = B. For example, enum E { A(u8), B(u8) } is a sum type u8+u8 and has 512 (= 2⁸ * 2) possible values, whereas a union type u8|u8 has only 256 (= 2⁸) possible values (because the two u8 types are equal and get collapsed into one).

programmerjake commented 3 months ago

Rust has lifetimes where each different lifetime makes references different types, but Rust is intentionally designed such that when generating the final compiled code, it never knows what lifetimes anything has (lifetimes have been erased).

@programmerjake Aren't all types effectively erased at runtime? To distinguish lifetimes we would use the same mechanism as to distinguish types (so probably a discriminant).

sort of (e.g. wasm can still distinguish between passing a i32 and a i64), but that isn't the relevant difference, which is that Rust, when generating the final code (at the stage where it substitutes the actual types into generics and generates separate functions for each combination of types used, so it does know the exact types then except without lifetime information).

If Rust decides they're not the same type, then when some other function tries to call this function where 'a and 'b are actually the same lifetime, then the caller will think it's passing a TypeScript-style Union type with only one variant, but do_match's code will have been generated for a TypeScript-style Union type with two variants, which is also unsound and Undefined Behavior!

I'm not exactly familiar with TypeScript unions, so could you expand on this? What exactly would cause Undefined Behavior here?

The key part of TypeScript-style unions here is that they act like enums except they collapse duplicate types all into one variant for each set of duplicate types. The part that causes Undefined Behavior is where the calling function tries to pass effectively an enum with one variant (so it usually just passes the variant's value without any data on which variant is passed since there's only one option), but the called function expects effectively an enum with two variants so is expecting data on which variant is active ... this ABI mismatch makes the function call Undefined Behavior.

programmerjake commented 3 months ago

Beyond that, I agree that anonymous sum types including generics are a bit awkward. It probably shouldn't even be possible to match on Generic type parameters, but only on their bounds.

I disagree, I think that anonymous sum types should be like tuples, where you select which one you want based on position, and selecting which one you want based on type should be syntax sugar at most -- so by the time any code is generated and/or any generics are substituted with actual types, all such matching is already converted to matching based on position so doesn't need to know if types are the same or not.

Dessix commented 3 months ago

Positional selection does very little to help us solve the primary ergonomics problems in Rust - most notably, error handling being atrocious. Sadly, reviewing the opening post, it does describe a positional selector mechanism, which ... While novel, leaves us in the doldrums with regard to solving existing, practical problems.

We should probably file or contribute to an RFC for anonymous / TypeScript-style (but automatically tagged, and thus safely matchable back to original type!) unions. Whether such an RFC works through sum types in the compiler or not is a matter of design and implementation, and - while it may have Rust-ABI implications in the long term - does not actually matter all that much for a description of intended usage and syntax.

burdges commented 3 months ago

I long proposed TinyBox<dyn Error> for error handling, but someone argued convincingly that returning Result<T,[usize; 2]> incurs too much overhead, so tagged unions likely incur too much overhead too.

I suspect error handling requires one first considers the actual calling convention. At present Rust only has variance from lifetimes, but we could consider either complicating variance or else providing implicit impl Froms for variant types, so either way Result<T,SpecificErrorZST> work, and compiles like Option<T> under the hood.

We'll likely need attributes that constructed the specific variant types, or at least tell the compiler what's going on.

pub enum MyErrors {
    #[variant_type_zst]
    ErrorX,
    #[variant_type_zst]
    ErrorY,
    #[variant_type]
    #[unique_non_zst]
    ErrorZ(&'static str),
}

fn foo(..) -> Result<T,MyErrors::ErrorX> { .. }

Along similar lines, you could imagine explicitly merging enums, which only works within one crate and adjusts discriminants, like

pub enum CrateError union {
    module1::Error,
    module2::Error,
    module3::Error,
    // AnotherCrateError // Invalid since discriminant defined in another compilation unit
}

Anyways, there are perforamance concerns which require care, so anonymous sum types might fit poorly.

Dessix commented 3 months ago

... someone argued convincingly that returning Result<T,[usize; 2]> incurs too much overhead, so tagged unions likely incur too much overhead too.

If they can't afford to use tagged unions in their errors, they can choose not to do so. The point here is to supplant the need for layers of wrapping thiserror or even boxing / heap-allocating newtypes that do nothing but bog down user codebases.

The whole point of bringing this up was to make it possible to avoid writing all of that boilerplate, since all of that extra work just gets us ... Java's old Checked Exceptions, which were too boilerplatey for anyone to actually use back then either, despite being less work than our current state of affairs.

Rufflewind commented 3 months ago

A hypothetical syntax for anonymous sum types with labeled choices might be:

fn foo(...) -> Result<T, Err(enum { X, Y(i32), Z(String) })> { ... }

fn bar(e: enum { X, Y(i32), Z(String) }) {
    match e {
        enum::X => { ... },
        enum::Y(i) => { ... },
        enum::Z(s) => { ... },
    }
}

which could desugar into, say:

enum __AutogeneratedEnum735fc68a {
  X         = 0x4b68ab38,
  Y(i32)    = 0x18f5384d,
  Z(String) = 0xbbeebd87,
}

fn foo(...) -> Result<T, __AutogeneratedEnum735fc68a> { ... }

fn bar(e: __AutogeneratedEnum735fc68a) {
    match e {
        __AutogeneratedEnum735fc68a::X => { ... },
        __AutogeneratedEnum735fc68a::Y(i) => { ... },
        __AutogeneratedEnum735fc68a::Z(s) => { ... },
    }
}

The discriminants are generated deterministically from the labels "X", "Y", and "Z" respectively. (This toy example uses sha256.) Deterministic discriminants would allow for some degree of cheap enum-to-enum coercion, e.g. perhaps Err(enum { X, Y(i32), Z(String) } could be upcasted to Err(enum { W(...), X, Y(i32), Z(String) }.

Dessix commented 3 months ago

Why do we erase type identity prior to solving which type we're passing via which paths? This seems to propose that we can't fix that - or even patch it over with a rule like "matching types combine to the shortest lifetime"? Giving up on ergonomics because our type system throws information away too early and instead making the user resolve the typing, whether by position or by name, seems like a rather golang direction to take.

glaebhoerl commented 3 months ago

I'll give this one more try:

I originally opened this issue back in 2013. It is about anonymous sum types with positional matching, for symmetry with tuples. No extra type-based magic, just as with tuples. I agree that type-based features also seem potentially valuable, but please, I beg, take those discussions to other issues and threads, because they are separate features.