janpaulpl commented 1 year ago

Issue

Currently, the gvc0 pipeline assumes static typing for the AST. PyTEAL uses a mix of static typing for TEAL types and dynamic typing for pythonic declarations. The static typing for TEAL types such as a = Int(2) can be directly translated as int a = 2; for our untyped AST in the parser.

However, for pythonic declarations such as add_expr = Add(a, b) we have to take the Node from the untyped AST to a typed AST. The type inference algorithm runs through the following steps:

[ ] 1. Assign a type or type variable to the expression and each subexpression. For known expressions, like +, — and so on, use the types known for these expressions. Otherwise, use type variables — placeholders for not known types.
[ ] 2. Generate a set of constraints on types, using the parse tree of the expression. These constraints are something like this: if a function is applied to an integer, then the type of its argument is integer.
[ ] 3. Solve these constraints by unification — an algorithm for solving equations based on substitutions.

Here's an example

We have information from the untyped AST for an incremental function

def inc(x): Add(x, 1)

In the first step, we want to assign type variables to each expression. For each occurrence of a bound variable, we know that it must have the same type. Bound variables are just children bounded to the parent def; we get the resulting tree

                  def
              /     |  \
          inc    x    @
                         /  \
                       @   x
                     /     \
                   Add     1

Where the @ stands for function application.

The new tree resulting from our first step is as follows:

                  def
              /     |     \
      inc::t0  x::t1  @::t5
                         /  \
                     @::t4   x::t1
                     /   \
            Add::t4     1::t3

The set of constraints on types can now be generated. (i) t3: Int from 2.instanceOf(Int) (ii) t2: Int -> Int -> Int because Add.instanceOf(Int -> Int -> Int) (iii) x stays type t1 (iv) We can now type the function calls. def has a type based on its parameters and return calls. def:: t0: t1 -> t5, t5:: t4: t1 -> t5, and t4:: t2: t3 -> t4.
The final step lists the constraints from the previous step:

1° t0 = t1 -> t6
2° t4 = t1 -> t6
3° t2 = t3 -> t4
4° t2 = Int -> (Int -> Int)
5° t3 = Int

We can now infer a 6th constraint

6° t4 = Int -> Int    (from 3° and 4°)

And finally from 6 and 2 we can infer

8° t1 = Int    (from 2° and 7°)
9° t6 = Int    (from 2° and 7°)

And the final typed AST for def inc(x): Add(x, 1) is

t0 = Int -> Int
t1 = Int
t2 = Int -> Int -> Int
t3 = Int
t4 = Int -> Int
t6 = Int

janpaulpl commented 1 year ago

For the sake of hopefully advancing the speed of this implementation, I'm going to describe how the current inferencing algorithm looks like and what the main challenges are!

Overview

Currently, the algorithm for type inferencing is based on Unification, in which we view type signatures for terms as equations, and the equation has variables which have to be solved, unknown type placeholders.

These type equations (as they're defined in the Issue) are called Constraints and for our full program we'll have a Constraint Set which has to go through Unification, as in, unify every constraint to a concrete type.

The algorithm for calculating the constraint set is made up of judgment rules, for example the constraint set for if statements come from the following judgment:

   C' = C1 U C2 U C3 U (T1 = Bool, T2 = T3}
-------------------------------------------------
G |- if t1 then t2 else t3 : T2 | X1 U X2 U X3 C'

A quick breakdown of what these symbols mean:

G |- x: T1 means that x has type T1 under the context G
G is, practically, the symbol table that records the defined variables and their types.
X and C are also context symbols. X is the set of generated type placeholders during the algorithm, and C is the already-calculated set of constraints.
G -> x : T1 |X C means that x has type T1 under the symbol table G, with the already-calculated constraint set C and the generated additional type placeholders from our algorithms stored in X.

Because Scala is a fun functional language, we can view this algorithm as a function that with a lookup table G which takes x as a n input, and we produce X and C as a tuple output.

Implementation

For the implementation, the canonical way of introducing unification in Scala seems to be using an enum. I'll explain later why this is awful for our case, but I think the re-formatting to fit our needs shouldn't be that difficult.

We need the following information in our enum Type:

Type.Base which represents monotypes, types that can be represented by single symbol. In our case, these would include Int, Bool, and Bytes. Bool because although PyTEAL doesn't seem to use it as a primitive, we have to convert it to a C0 IR which does use Bools.
Type.Forall represents the placeholders.

The format would look something as follows:

enum Type:
  case Base(n: String) extends Type
  case Constructor(from: Type, to: Type) extends Type
  case Forall(n: String) extends Type
end Type

object TypePrimitives:
  val Int: Type = Type.Base("Int")
  val Bool: Type = Type.Base("Bool")
  val Bytes: Type = Type.Base("Bytes")
end TypePrimitives

Now, this is where it gets tricky.... We have to also define our AST as an enum! I'm not really sure if there's a way to extend our untyped AST to out typed AST enum, or if we have to change the structure of the enum to fit an object. Either way, this is how it would look like:

enum SyntaxNode:
  case If(condition: SyntaxNode, trueClause: SyntaxNode, falseClause: SyntaxNode)

Now we have to define the Constraint Set. It has three members, first one the calculated type signature, second the type placeholders (whatever unknown t# we get), and thirdly the constraint set (for example we have information that 0 -> Int). Finally we implement the symbol table as a mutable map from String -> Type

opaque type Binding = mutable.Map[String, Type]

I'm not sure if we need to implement a freeVariable identifier because we don't have globals (I think???), so everything can be identified in scope. We can discuss the details of this in-person, but hopefully the pipeline is clear!

janpaulpl commented 1 year ago

Concerning the data representation from untyped AST: We will keep the untyped/typed AST as a single file. We just have temporary type values in the untyped AST, which then are “hardcoded” in the typed enum, and this information is extended back in the “untyped” AST, gradually typing it.

gradual-verification / gradual-TEAL

Type-inference algorithm #6

Issue

Here's an example

Overview

Implementation