Monomorphisation - Githubissues

Box works like this: We have some code, code, which represents some function body. When we write a function such as

f x = code

when we evaluate f x, we get a hold of the statements that make up the functions body. There is nothing in the emitted code that says that this should be a separate procedure that can be invoked.

body-stmts

When we instead define our function like this:

f = box "f" ["x"] $ \x -> do
  code

box helps us by making it explicit that this is a procedure, by inserting some extra meta-statements in the code that delimits the procedure scope.

Procedure "f"
Argument "x" {- and the concrete value it was applied to -}
body-stmts
Result ()

This is nice as we can now 'call' procedures in the generated C code, while having the Haskell code still look like any normal function application. The alternative would be to define some type of functions, and having a special app :: (a :-> b) -> a -> b combinator for applying functions to arguments.

One kink about this is that if you try to generate code for a recursive function, you will never terminate. At each recursive application we will run into the entire functions definition again.

f = box "f" ["x"] $ \x -> do
  fork [ f x ]

becomes

Procedure "f"
Argument "x"
Fork [ Procedure "f"
       Argument "x"
       Fork [ Procedure "f"
              ...
              Result ()
            ]
       Result ()
     ]
Result ()

While we have now gained a very pleasant way of defining and applying functions, we end up with an infinite program instead. To combat this, we have defined a second datatype to represent programs, which uses a different set of statements. For instance, the constructor for fork in that AST is simply Fork :: [(String, [(Either SSMExp Reference)]] -> Stm. It is not recursive at all, and simply represents that we wish to call a function with this name with these arguments, which can be either expressions or references. The types of these arguments are also recorded.

While we do the transpilation from the initial, infinite datatype to this more flatter one, we also produce a map funs :: Map String Procedure. Procedure looks like this:

data Procedure = Procedure
  { -- | Name of the procedure.
    name      :: String
    -- | Parameter names and types of the procedure.
  , arguments :: [(String, Type)]
    -- | Statements that make up this procedure.
  , body      :: [Stm]
  }

This map simply maps function names to their definition. When a program is later turned into a C-file, every procedure in this map is compiled to a C procedure, and at fork sites an ordinary procedure call is generated. This map is touched at every fork statement by checking the first statement of the forked code (which should be a Procedure n meta statement). If the name in the Procedure constructor has not been seen before, the body of that function will be recursively transpiled and then turned into a Procedure value, and then an entry will be made in the map to reflect that we have now seen and recorded this function. If it has already been seen, we do nothing.

Now comes the issue of polymorphic functions. If we write this function:

f :: Ref a -> Exp a -> SSM ()
f = box "f" ["x", "y"] $ \x y -> do
  x <~ y -- assignment

what code should we generate? When we assign values to references in the generated code, we need to specify the type of the thing we are assigning a value to. Here we are writing a polymorphic function that can be applied at any type, so which type should the generated code talk about? The issue is that we have no way of being polymorphic in the generated code, so this can not be done.

Monomorphization

Monomorphization is a compilation technique where the type at which a function is applied is recorded, and if that function is polymorphic, a specialized version of the function is made. This pleases the type checker.

id :: a -> a
id x = x

f = id 3
g = id True

-- after monomorphization becomes
idint :: Int -> Int
idint x = x

idbool :: Bool -> Bool
idbool x = x

f = idint 3
g = idbool True

After this transformation there are no polymorphic functions left in the program. The price we pay is that we now essentially have two identical functions that we need to generate code for, which will occupy our precious, scarce memory on our IoT devices.

If this is a price that we are willing to pay, I believe we can make this happen very easily. The only thing we need to do is to change box to also accept a list of Types, to reflect the type of the application. We already have a typeclass SSMType that has a function typeOf :: proxy a -> Type, which marshals a Haskell type to the corresponding representation in our EDSL. If we now change box & Procedure to behave like this:

-- | I want to emphasize again that the box crap can probably be generated by a plugin, so hopefully our end users won't have to worry about this.
f :: Ref a -> Exp a -> SSM ()
f = box "f" [typeOf (Proxy @a), typeOf (Proxy @a)] ["x", "y"] $ \x y -> do
  x <~ y -- assignment

Two applications of f would then produce different code if the types of the arguments are different, while the code is still the same if the types are the same.

x <- var 0     -- Ref Int32
y <- var true' -- Ref Bool

fork [ f x 5
     , f y false'
     ]

-- the generated code would look like this
Fork
  [ [ Procedure "f" [Ref TInt32, TInt32]
    , Argument "x" {- some val -}
    , Argument "y" {- some val -}
    , SetRef "x" (Lit TInt32 (LInt32 5))
    , Result ()
    ]
  , [ Procedure "f" [Ref TBool, TBool]
    , Argument "x" {- some val -}
    , Argument "y" {- some val -}
    , SetRef "x" (Lit TBool (LBool False))
    , Result ()
    ]
  ]

When we transpile the above statements to the flat version which replaces the recursive Fork constructor with the non recursive one, we currently only inspect the name of the procedure to determine if we've seen it before. Now I suggest that we also check the type of the arguments to the procedure. If the (name, [type]) pair has not been seen before, specialize a new version of the function and put it in the map, otherwise, just do nothing. The name of this specialized version would be "f" and some generated suffix, to make it distinct from the other versions of f. We can derive a simple name from the types to reflect the type in the name.

This is a simple change that would make it specialize functions by itself, with the machinery that is already in place. We do not need to add a separate compiler pass or anything. Function applications in Haskell still are polymorphic, but behind the scenes, functions would be specialized. In #19 I wrote a bit about polymorphic functions and how we can not write polymorphic programs currently.

A big enhancement with embedding a language in Haskell as opposed to writing a custom compiler frontend is that we can piggyback on the type system in place. Being restricted to writing monomorphic programs would be a real shame. Making use of Haskell's polymorphic type system is a killer feature that we would want.

Agreed! And if we can do this we can definitely implement the supervisor pattern you mentioned earlier.

I just had a meeting with @koengit where we talked a bit about this. Just appending type information might not be enough. We can describe issues and fixes like this:

Scenario 1

f = box "f" ["r"] $ \r -> do
  v <- deref r
  assign r (v + 1)

+ is overloaded so it can work for any numerical type. We solve this scenario with the system I described above, by recording types.

Scenario 2

f = box "f" ...
g = box "f" ...

Here we have an issue since we have two different functions in Haskell, but we are naming them the same. Even if they are applied to the same types the bodies of the two functions might be different. We can fix this by having a plugin that makes sure the proper names are inserted (and clearly documenting that the brave programmer that wishes to do this without a plugin must make sure to name things uniquely. Crap like the above example is disallowed).

Scenario 3

f1 = f where f = box "f" ...
f2 = f where f = box "f" ...

Code above is not necessarily wrong by the recipe we gave above. We can solve this by having the plugin also insert the line/column-numbers in the function name (essentially grabbing the source location. The current plugin already does this, but in another context).

Scenario 4

f a = box "f" ["x"] $ \x ->
  do ... x ... something with a ...

Here a appears as a parameter to the Haskell function f. Essentially, the body of f 3 and f 17 will be different, even if their names and types are the same. Not clear how to fix this. I am envisioning that the programmer will just write stuff like

f x a = do
  ...

-- plugin takes the above and converts to this
f = box "f" ["x","a"] $ \x a -> do
  ...

So this issue would only arise for the brave programmer that used box manually?

Scenario 5

f = box "f" ["x"] $ \x ->
  do ... x ... ?a ...

I will show you the ImplicitParams extension today, where we can use variables like ?a. This variable does not appear in the definition at all, we just wildly use it. No idea how to tackle this, currently.

There are some semantic approaches we can take to monomorphizing also. When we see a function f with body b we can check if b matches the body we've seen before. Works perfectly for functions that are not recursive. For recursive ones there is no guaranteed way that this will work, but perhaps some heuristic is fine, such as traversing the function body to a depth of 5 or something?

Tryhard solution is to define every function that you can fork globally before calling your program, and then passing that in as some parameter to your program. This is not a solution I think is good, as it removes the nice function creation/application we have right now.

Something I don't know a lot about but which @koengit is familiar with is observational sharing, where you can have two functions

f = <expr>
g = <expr> -- same body as f

With observable sharing you can tell that f =?= f but not that f =?= g. I need to read up on this approach before I try to describe some examples in more detail.

ssm-lang / Scoria

Monomorphisation #25