metagn commented 1 year ago

Abstract

Add a structural, unordered sum type to the language, equivalent in the backend to a corresponding object variant type.

Motivation

From #525:

Nim has variant objects, which are capable of modelling [sum types]. These variant objects are useful [...], but have a number of serious drawbacks [...].

The names of fields in variant objects must be unique - they cannot be repeated, even across different variants.

A separate enum must function as a "kind" tag and be discriminated on by case within the object. Having a separate single-use enum is undesirable, and having to discriminate against it is awkward - especially when there are no shared fields, as is often the case.

On top of these, I would add:

correct field access for object variant branches is a bit indirect to statically enforce as each branch is attached to the value of a field in the object (although it can still be done)
it's hard to model "object variant branches" as a concrete structure which makes things like clearing their fields awkward (#56)
object variants are nominal by design and this is not always convenient or expressive

525 and many other proposals propose some version of ADTs to deal with these problems. However this still has issues:

ADTs are usually a package deal with some kind of pattern matching scheme which the language does not currently have
- pattern matching is also generally proposed to use the case syntax which is ambiguous with the existing case syntax which can include complex expressions in its discriminators that evaluate to values
ADT variants as product types have to choose between one or both of ordered fields or named fields which is redundant over the existing split mechanism of object and tuple types
ADTs are still nominal, so is each variant type which is redundant over enums

Description

In this proposal I will use the temporary placeholder syntax of union(A, B, ...) to represent these sum types. I like the syntax {A, B, ...} instead (at least as sugar) due to both 1. mirroring with (A, B, ...) tuple syntax and 2. similarities in behavior with sets, but this syntax might be hard to make out in this post.

Basically we add a new type kind to the language that has an indefinite number of children that are all nominally unique concrete types, i.e. A, B = distinct A and C = distinct A, D[T] = distinct A can form a type union(A, B, C, D[A], D[B]) but A, B = A, C = sink A, D[T] = A etc can only form union(A). In any case union(A, B) is also equal to union(B, A), meaning internally the children of the type are sorted under some scheme.

The type relation between A and union(A, ...) is that they are convertible. The subtype relation as in inheritance might also work but for now this seems the most simple.

In the backend, Foo = union(A, B, ...) (where the children are sorted) becomes equivalent to something like:

type
  FooKind = enum
    fooA, fooB, ...
  Foo = object
    case kind: FooKind
    of fooA: fieldA: A
    of fooB: fieldB: B
    # ...

For the sake of safety in the case of uninitialized values or efficiency in the case of things like Option we can also introduce a none/nil kind that represents no type in the union. This would be unavailable on the frontend, values of these types must always be initialized to a value of some type in the union.

type
  FooKind = enum
    fooNone, fooA, fooB, ...
  Foo = object
    case kind: FooKind
    of fooNone: discard
    of fooA: fieldA: A
    of fooB: fieldB: B
    # ...

Construction

Construction of values of these types is as simple as a type conversion between the element type to the sum type. That is:

var x: union(A, B, ...)
x = A()
x = B()
# ...

first transforms into the following at type-check time:

type Foo = union(A, B, ...) # sorted
var x: Foo
x = Foo(A())
x = Foo(B())

which the backend can then turn into:

type
  FooKind = enum fooA, fooB, ...
  Foo = object
    case kind: FooKind
    of fooA: fieldA: A
    of fooB: fieldB: B
    # ...

var x: Foo
x = Foo(kind: fooA, fieldA: A())
x = Foo(kind: fooB, fieldB: B())

Destructuring & inspection

We can very trivially reuse the existing case and of frontend mechanisms to check which type they are at runtime with (I believe) zero incompatibilities. And again destructuring just becomes a type conversion.

type
  A = distinct int
  B = object
    name: string
    age: int
  Foo = union(A, B)

proc foo(x: B) =
  echo x

proc foo(x: Foo) =
  case x
  of A: # A here is a typedesc, which is a concrete value at compile time, so this is compatible with how `case` works in the current language
    echo "kind A"
    echo A(x).int # conversion
  of B:
    echo "kind B"
    let x = B(x)
    foo x # can operate on individual variant types
  # possible types are exhausted, no need for else:

var x = Foo(A(12))
assert x of A
assert not (x of B)
foo x # kind A; 12
x = B(name: "John", age: 34)
assert x of B
assert not (x of A)
foo x # kind B; (name: "John", age: 34)

In the backend:

proc foo(x: Foo) =
  case x.kind
  of fooA:
    echo "kind A"
    echo x.fieldA.int
  of fooB:
    echo "kind B"
    let x = x.fieldB
    foo x

var x = Foo(kind: fooA, fieldA: A(12))
assert x.kind == fooA
assert not (x.kind == fooB)
foo x
x = Foo(kind: fooB, fieldB: B(name: "John", age: 34))
assert x.kind == fooB
assert not (x.kind == fooA)
foo x

A limitation is that there is no good way to have the information of the exact type of the union as a value on the frontend (what would normally be x.kind), but we could maybe alleviate this by allowing to attach these to an enum type, i.e. union(fooA: A, fooB: B, ...). But then the question arises of whether these types are compatible with other union types of the same elements but with a different/no attached enum. In any case you can generate a case statement like case x of A: fooA of B: fooB ... but this would be less efficient than just using the kind in the backend.

Other points

A frequently mentioned use case in #525 was recursion with pointer indirection. In the current language this works in union typeclasses but not in tuple types: the manual mentions that "In order to simplify structural type checking, recursive tuples are not valid". Maybe recursive unions can just be nominal? Or the canonicalization scheme (which tuples don't have) can account for recursion.
This is not an alternative solution to pattern matching or object variants, it's just an alternate solution to ADTs for the current problems with object variants, and ADTs happen to require pattern matching while this doesn't (but this is still compatible with pattern matching). People tend to see object variants as black and white, worse or better than other representations of sum types but in practice they both have their uses especially in such a general purpose language.
- It should also be ridiculously easy to implement an ADT definition macro with this, i.e. type List[T] = adt Nil | Node[T](value: T, next: List[T])
This has partially been implemented in a library (https://github.com/alaviss/union/) but a library solution is not sufficient for proper use of this:
1. Type inspection mechanisms for macros are not really sophisticated enough
2. Type conversion is not consistent with the rest of the language, and in general nonstandard syntax additions will need to be used
3. Supposedly limitations with relation to generics but this might be by design
4. I believe this library uses inheritance and not discriminated unions which might be less performant
5. Use of the VM influences compile times
union(A, B, C) should probably not be equal to union(A, union(B, C)) since union(B, C) is still a concrete type; we can have an operation + that "merges" union types, so that A + union(B, C) + D or union(A, B) + union(C, D) all give a flattened union(A, B, C, D). This implies unions have set behavior, which might be the generics limitation mentioned above; it is nontrivial and in some cases impossible to infer generic parameters from these types.

Links

union library by alaviss where this is implemented to a limited degree https://github.com/alaviss/union/
Crystal unions which are like this but seem to be merged with the typeclass union type (or or | in nim) which is not the goal here https://crystal-lang.org/reference/1.8/syntax_and_semantics/union_types.html
initial comment where this was proposed in RFC this was inspired from https://github.com/nim-lang/RFCs/issues/525#issuecomment-1560784693

Code Examples

type
  A = object
    x, y: int
  Foo = union(int, A)

assert Foo is union(A, int)

proc `$`(f: Foo): string =
  if f of int:
    result = $int(f)
  elif f of A:
    let f = A(f)
    result = "A(" & $f.x & ", " & $f.y & ")"
  case f
  of int: $int(f)
  of A: "A(" & $A(f).x & ", " & $A(f).y & ")"

var f: Foo
f = A(x: 1, y: 2)
assert $f == "A(1, 2)"
f = 456
assert $f == "456"

type
  StringLeaf = distinct string
  IntLeaf = distinct int
  ParentTree = distinct seq[Tree]
  Tree = ref union(StringLeaf, IntLeaf, ParentTree)
let tree = Tree(ParentTree(@[StringLeaf("abc")]))

Backwards Compatibility

Should be completely compatible

konsumlamm commented 1 year ago

So for defining a normal ADT, you first need to define a distinct/object for each variant? I don't see the advantage of this over directly supporting ADTs.

metagn commented 1 year ago

I already wrote all the points above but I guess more succinctly:

construction and deconstruction are not tied to pattern matching and just become reuse of existing type conversion, case, of mechanisms
ADTs have a large amount of redundancy with existing types, this is more atomic
additional functionality gained by making these structural

Also "defining a normal ADT" does not have to be any different than that. It would be really simple to build a macro that defines one of these types based on ADT syntax (I used the syntax type List[T] = adt ... above but that wouldn't work with generics):

type List[T] {.adt.} = Nil | Node[T](value: T, next: List[T])
# becomes
type
  Nil = object # or Nil[T] = object
  Node[T] = object
    value: T
    next: List[T]
  List[T] = union(Nil, Node[T])

If anything "you have to define distinct/object types for each variant" is a good thing, because the information about each variant is available as an existing type. You also get to act on these as a "set of types", meaning you can break them down into their parts, or tack on new types easily.

Imagine you have a type like this:

type
  FooKind = enum fooInt, fooFloat, foo32Array
  Foo = object
    case kind: FooKind
    of fooInt:
      i: int
    of fooFloat:
      f: float
    of foo32Array:
      arr: array[32, Foo]

Now imagine if you had a large seq of Foo that only contained fooInt and fooFloat nodes. You wouldn't define it as seq[Foo] because then it would use 30x as much memory than if you just considered the branches of Foo that used int or float. Instead you have to do something like:

type
  FooKind = enum fooInt, fooFloat, foo32Array
  Foo = object
    case kind: FooKind
    of fooInt:
      i: int
    of fooFloat:
      f: float
    of foo32Array:
      arr: array[32, Foo]
  FooIntFloat = object
    case kind: FooKind
    of fooInt:
      i: int
    of fooFloat:
      f: float
    else: discard

proc intFloatOnly(foo: Foo): FooIntFloat =
  case foo.kind
  of fooInt: FooIntFloat(kind: fooInt, i: foo.i)
  of fooFloat: FooIntFloat(kind: fooFloat, f: foo.f)
  else:
    raise newException(FieldDefect, "runtime error for non-int-float kind!")

proc backToFoo(fif: FooIntFloat): Foo =
  case fif.kind
  of fooInt: Foo(kind: fooInt, i: fif.i)
  of fooFloat: Foo(kind: fooFloat, f: fif.f)
  else: discard # unreachable

var s = newSeqOfCap[FooIntFloat](2000)

proc generateFoo(n: int): Foo =
  if n mod 2 == 1:
    Foo(kind: fooInt, i: n)
  else:
    Foo(kind: fooFloat, f: n.float)

proc consumeFoo(foo: Foo) =
  echo foo

for i in 1..2000:
  let foo = generateFoo(i)
  s.add(intFloatOnly(foo))

for x in s:
  let foo = backToFoo(x)
  consumeFoo(foo)

ADTs just make the syntax for this nicer, and actually make it worse because you cannot reuse FooKind, instead for each branch of the restricted type you have to come up with either conflicting names or names distinct from the original.

That is, unless you introduce some "restricted ADT" type like Int | Float that is smart enough to use the smallest size and automatically generate the conversion between it and the real ADT type. Except you cannot use | because it's used for the union typeclass. Maybe union(Int, Float)?

Now with this proposal:

type
  Int = distinct int
  Float = distinct float
  Array32 = distinct array[32, Foo]
  Foo = union(Int, Float, Array32)
# or, assuming an `adt` macro exists 
type Foo = adt Int(int) | Float(float) | Array32(array[32, Foo])

var s = newSeqOfCap[union(Int, Float)](2000)

proc generateFoo(n: int): Foo =
  if n mod 2 == 1:
    Int(n)
  else:
    Float(n.float)

proc consumeFoo(foo: Foo) =
  echo foo

for i in 1..2000:
  let foo = generateFoo(i)
  case foo
  of Int: s.add(Int(foo))
  of Float: s.add(Float(foo))
  # we get to deal with invalid cases at the callsite because it's much less cumbersome
  else: raise newException(FieldDefect, "shouldn't happen")

for x in s:
  let foo =
    case x
    of Int: Foo(Int(x))
    of Float: Foo(Float(x))
  consumeFoo(foo)

I have to be clear here though that I am not pushing the union(A, B) syntax, it's just the one that came to mind for the purpose of demonstration.

metagn commented 1 year ago

Something this could maybe do away with to make the implementation reasonable is order-invariance, i.e. union(int, float) is not union(float, int). This would keep a lot of the benefits of structural typing as well as make the potential feature of matching to enums i.e. union(fooA: int, fooB: float) much easier to implement.

A common use case I didn't mention above would be:

type Error = distinct string
proc foo(x: int): union(int, Error) =
  if x < 0:
    return Error("argument was negative")
  if x mod 2 == 0:
    return Error("argument was odd")
  result = (x - 1) div 2

let res = foo(122)
case res
of Error: echo "got error: ", string(Error(res))
of int: echo "got successful result: ", int(res)

Stuff like this would not be affected by order relevance.

Adding onto that example, this being language level means we could optimize things like Option[range[1..5]] = union(void, range[1..5]) into range[0..5] in the backend. I think Rust does optimizations like this for enums.

Araq commented 1 year ago

order-invariance is a completely alien concept for Nim and I don't like it.

arnetheduck commented 1 year ago

Adding onto that example, this being language level means we could optimize things like Option[range[1..5]] = union(void, range[1..5]) into range[0..5] in the backend.

There are two levels of order-invariance - I argue here that there are significant benefits to have it at the ABI level along with other freedoms such as moving fields around between objects, flattening them in other ways than current Sup, joining "smaller" fields like bool etc - I don't think that should extend to the language level however - ie in source code, order should remain significant.

Araq commented 11 months ago

This design fundamentally merges a runtime value (often called "tag") with a typename. This seems to have unforeseeable interactions with Nim's meta-programming:


type
  U = union(Foo, Bar)

macro inspectType(t: typedesc)

inspectType Foo # valid, Foo is a type.

macro inspectValue[T](t: static T)

inspectValue Foo # also valid?

The RFC also offers no solution for pattern matching. But conversions like string(Error(res)) are a poor substitute and it's not obvious that these conversions cannot fail at runtime (or can they?).

The fact that this sum type is structural is not a huge benefit as the 2 most important structural types are easily replicated via generics: Opt[T] and Either[X, Y]. On the contrary a nominal type can naturally offer a couple of pragmas that influence the layout, where hidden gaps can be exploited or even ABI versions. These things are much harder to do with a structural type where it's encouraged to repeat single constructions like union(int, ErrorCode) everywhere.

Araq commented 11 months ago

Once again, I arrive at something like:


type
  Node = ref enum
  of BinaryOpr: 
    x, y: Node
  of UnaryOpr:
    x: Node
  of Name(string)

  Option[T] = enum
  of None()
  of Some(T)

  Either[A, B] = enum
  of Le(A)
  of Ri(B)

metagn commented 11 months ago

merges a runtime value with a typename

This isn't a huge frontend issue, expressions can already have type typedesc (which is incompatible with static), the compiler would internally understand their meaning in these contexts. Though this still has problems, for example a naive case exhaustiveness check would be O(n^2) in terms of sameType calls. We could still have a tag value separate from the type i.e. Node = union[BinaryOp: (Node, Node), UnaryOp: Node, Name: string], but this wouldn't have an obvious construction/deconstruction syntax.

it's not obvious that these conversions cannot fail at runtime (or can they?)

It wouldn't be different from the current situation with object variant branch access, which I realize now pattern matching is better for.

If we don't reuse existing types, the symbols like BinaryOpr, Some have to behave entirely local to their parent type, i.e. the expression None() is invalid, we have to do Option[int].None() (or this gets inferred). This implies the entire expression pattern None() or Some(x) etc. is like a parametrized enum value subject to overloads (including overloads with respect to generic parameters) as opposed to being a constructor of a type None or Some. An implementation would have to pay attention to this.

Araq commented 11 months ago

An implementation would have to pay attention to this.

Correct, but we have been making enum symbols smarter ("overloadable") already.

nim-lang / RFCs

Structural sum type #527

Abstract

Motivation

525 and many other proposals propose some version of ADTs to deal with these problems. However this still has issues:

Description

Construction

Destructuring & inspection

Other points

Links

Code Examples

Backwards Compatibility