miking-lang / miking

Miking - the meta viking: a meta-language system for creating embedded languages
Other
51 stars 30 forks source link

Replacing Tensors with Multidimensional Arrays #643

Open johnwikman opened 1 year ago

johnwikman commented 1 year ago

There has been some discussions lately about changing the type syntax for tensors, which currently looks like Tensor[a].

Tensors, as we want them, are to be fixed in size and (optionally?) mutable. A suggestion from my side would be to introduce arrays instead. In the normal case, they are just 1-dimensional, but the syntax would allow for arrays with arbitrary dimension.

Recall the syntax of sequences:

Examples of syntax for arrays (I am not super happy with these):

These are the various syntax suggestions that I could come up with. If anyone else has a better idea then please bring it forth. I am not really happy with any of these.

However, I do prefer the Int [a1, a2, ..., an]-syntax out of these. Maybe this could work if we can separate the [a1, a2, ... an] into its own dimension type. Maybe this is already an established concept, but here are my thoughts on it:

So for completeness:

lam m. Float [m] defines the 1D float array. However, in this case it suffices to write just Float [m] since we know that Float does not take any type arguments. Hence the [m] is inferred to be a dimension type instead of a sequence type.

Some additional thoughts:

One idea with using the lam n. notation, instead of say size n., is to annotate and perform static range checking. Suppose we have a 2D array let x: Float [m,n] = ... in. Then we can get a value from the array using the intrinsic arr2dgetExn (get a value from a 2D array):

let arr2dgetExn: all a. a [m,n] -> Int -> Int -> a = ...
arr2dgetExn x 2 3

However, this might result in an error or exception during runtime. If we instead use a variant with range checking, then we could avoid this. In this case we would have

let arr2dget: all a. a [m,n] -> Int in [0,m) -> Int in [0,n) -> a
arr2dget x 2 3

What is not shown here, is that there would have to be a static check at compile time that verifies that 2 < m and 3 < n. In practice, there would most likely have to be a manual check in the code that this holds true, whose information would then further propagate into the if-statement allowing the compiler to verify this. This moves the potential runtime error (such as index out of bounds) from being hidden in the intrinsic to being explicit within the code. The idea is that the error can be handled in a much better way if the user is explicitly forced to perform error handling in their code, which they should do anyway, but often don't.

This would not only be limited to integers. Floats could also use this feature:

let gaussianSample: Float -> Float in (0,inf) -> Float = lam mu. lam sigma. ... in
let loguniformSample: Float in (0,x] -> Float in (x,inf) -> Float = lam a. lam b. ... in

In this case, we can make assumptions in the intrinsics that this will never throw an error and hence make optimizations based on that.

aathn commented 1 year ago

Tensors, as we want them, are to be fixed in size and (optionally?) mutable. A suggestion from my side would be to introduce arrays instead. In the normal case, they are just 1-dimensional, but the syntax would allow for arrays with arbitrary dimension.

Would this be any different from tensors, except for the terminology used?

For the syntax, how about having a type constructor Arr[n,m,...]? We would have for instance Arr[3] Int representing the type of 3-element integer arrays, or the type Arr[10,10] Float representing 10x10 Float matrices.

For the dimension types, I think it seems strange that [m,n] can appear both as a type of its own (as in Option [m,n]) and as a 'dimension specifier' (as in Int [m,n]). I see these two as fundamentally different usages, that don't necessarily need to be coupled together (and in addition, the syntax is unfortunate: the two types have completely different meanings yet look the same). I don't see a real utility of having [m,n] as a standalone type at the moment... When would the function lam x: [n,m]. getDim0 x be useful? Instead, we could allow the values n and m in the type to be referenced in the body of the code, like lam x: Int [n,m]. n. What do you think?

Do you know about dependent typing, i.e. systems where values can appear in types? For instance, in the dependently typed language Agda, one can define an inductive datatype Vec A n representing one-dimensional vectors of length n. Here, n is an ordinary value, but it is allowed to appear in the type to specify the size. Then we can write functions like lookup : ∀ {n} → Vec A n → Fin n → A, where Fin n is similar to the type Int in [0,n) in your example. I'm not suggesting we should have full dependent types, as typechecking gets hard with them, but perhaps it can give some guidance.

johnwikman commented 1 year ago

Tensors, as we want them, are to be fixed in size and (optionally?) mutable. A suggestion from my side would be to introduce arrays instead. In the normal case, they are just 1-dimensional, but the syntax would allow for arrays with arbitrary dimension.

Would this be any different from tensors, except for the terminology used?

The key difference would be that the rank is static for the multidimensional type. The rank of a tensor is not necessarily known at compile time. Otherwise the behavior would be the same. Though I would like to have a modifier to specify mutability. I.e. put an array as input knowing that none of its elements will ever be modified. Maybe use similar to rust-style mut notation to indicate that an array can may have its elements modified?

For the syntax, how about having a type constructor Arr[n,m,...]? We would have for instance Arr[3] Int representing the type of 3-element integer arrays, or the type Arr[10,10] Float representing 10x10 Float matrices.

Maybe, but I am not sold on this part. I would prefer notation that avoids the use of names such as Tensor[...], or Arr[...] a etc., such as we have with sequence types. I have never come to terms with this in other languages that use this kind of notation, but maybe this is the compromise you reach when you iron out the details. I am undecided here.

For the dimension types, I think it seems strange that [m,n] can appear both as a type of its own (as in Option [m,n]) and as a 'dimension specifier' (as in Int [m,n]). I see these two as two fundamentally different usages, that don't necessarily need to be coupled together (and in addition, the syntax is unfortunate: the two types have completely different meanings yet look the same). I don't see a real utility of having [m,n] as a standalone type at the moment... When would the function lam x: [n,m]. getDim0 x be useful? Instead, we could allow the values n and m in the type to be referenced in the body of the code, like lam x: Int [n,m]. n. What do you think?

It is not necessarily that I wished for it to be a standalone type to be used in type arguments, but more to avoid special exceptions in how we handle type application. I used the Option [m,n] as a special edge case to illustrate semantics in a nonsensical situation, but that it would still work and not be some special exception that we would have to account for.

Where I think it could be useful is in cases where you need to have variable dimensionality in your type. As an example:

type NdArrayOp d
con NdArrayAdd : (Float d -> Float d -> Float d) -> NdArrayOp d
con NdArrayFlatAccess : (Int -> Float d -> Float) -> NdArrayOp d

type NdObj d
con NdArrayA : Float d -> NdObj d
con NdArrayS : Float -> NdObj d

let mySingleApply : NdArrayOp d -> Float d -> NdObj d = lam op. lam a.
  switch op
  case NdArrayAdd op then
    NdArrayA (op a a) -- double the number
  case NdArrayFlatAccess then
    NdArrayS (op 0 a) -- get the first number
  end

This is probably a bit too arbitrary even for a toy example, but I do believe that there are cases when you would like things like array dimensionality to be an argument to types. Then the case of getDim0 would probably be quite useful, but might need to look a little different. A more realistic case that I am thinking of would be in the case of neural networks, and how you chain their layers together. They would be nicely abstracted away using ADTs with input and output dimension as type arguments.

I agree that the syntax becomes a bit unfortunate, as you have to know the original type to see if you get an array or if the type takes arguments. This risk of confusion is something that would be nice to solve, but I do not know how to do it in a good way. Prefix notation looks strange to me, but it certainly would solve it at least.

Do you know about dependent typing, i.e. systems where values can appear in types? For instance, in the dependently typed language Agda, one can define an inductive datatype Vec A n representing one-dimensional vectors of length n. Here, n is an ordinary value, but it is allowed to appear in the type to specify the size. Then we can write functions like lookup : ∀ {n} → Vec A n → Fin n → A, where Fin n is similar to the type Int in [0,n) in your example. I'm not suggesting we should have full dependent types, as typechecking gets hard with them, but perhaps it can give some guidance.

No I didn't, but thanks for pointing it out! This is something I probably will look into when I have the time :)