morganstanley / hobbes

A language and an embedded JIT compiler
http://hobbes.readthedocs.io/
Apache License 2.0
1.16k stars 105 forks source link

A step towards dependent types #362

Open adam-antonik opened 4 years ago

adam-antonik commented 4 years ago

I'm upset that I have to write Unqualifiers in C++ and not in hobbes. So upset I made this mess. This adds two Unqualifiers, TypeValueLower, and TypeApply.

Let's start with TypeValueLower. It lowers a TLong to a long, TString to a [char], and (quote a) to a. This allows us to write recordHeadLabel as

recordHeadLabel :: (r={h*t}) => r -> [char]
recordHeadLabel r = typeValueLower(((\_.unsafeCast()) :: (r={lbl:h*t}) => r -> lbl)(r)) :: (TypeValueLower _ [char]) => [char]

Great, but we still can't write recordHeadValue. Hence TypeApply.

TypeApply a `f` b c .. n

will lower any types after the second to ExprPtr's, apply them to f, and raise the result to a MonoTypePtr stored in a. For the arguments: TLongs to longs, TStrings to [char], (quote a) to a, and an arbitrary non-type variable to an encoding as a recursive variant named Type that stores that type information. For the result: long to TLong, [char] to TString, Type to its MonoType counterpart, and anything else to a quoted expression.

So we can do this

data Z a = ()
instance Show (Z a) where
  show x = "Z " ++ show(typeValueLower(unsafeCast(()) :: a))

zZero = unsafeCast(()) :: ((Z false))
zSucc :: (TypeApply b `\x.x+1` a) => ((Z a)) -> ((Z b))
zSucc = unsafeCast

Then

> zSucc(zZero)
Z 1

But also

instance (TypeApply c `(+)` a b) => Add (Z a) (Z b) (Z c) where
  l + r = unsafeCast()

instance (TypeApply c `(*)` a b) => Multiply (Z a) (Z b) (Z c) where
  l * r = unsafeCast()

mkZ :: (TypeApply b `\x.convert(x)::long` a) => a -> ((Z b))
mkZ _ = unsafeCast(())

Using the quote conversion to do

> mkZ(`2`) + mkZ(`3`)
Z 5

Let's now do something with the ExprPtr representation of a type, and write recordHeadLabel

// hide any fields that aren't the head, and rename the head field to "head"
rhvInner :: ([[char] * Type], int, [[char] * Type], bool) -> [[char] * Type]
rhvInner ms idx nms sf = if idx == size(ms) then nms else
  if sf then
    rhvInner(ms, idx+1, nms ++ [(".p" ++ ms[idx].0, ms[idx].1)], sf) 
  else
    if (size(ms[idx].0) > 1 and ms[idx].0[0:2] == ".p") then
      rhvInner(ms, idx + 1, nms ++ [ms[idx]], sf) 
    else rhvInner(ms, idx + 1, nms ++ [("head", ms[idx].1)], true)

rhv :: Type -> Type
rhv tyd = match unroll(tyd) with
  | |Record=ms| -> roll(|Record=rhvInner(ms, 0, [], false)|)
  | _ -> roll(|Prim=("unit", nothing::(() + Type))|)

rhvt :: (TypeApply b `rhv` a) => a -> b
rhvt = unsafeCast

recordHeadValue_ :: (r={h*t}) => r -> h
recordHeadValue_ r = rhvt(r).head

This is a fairly large lump of code, but it would allow moving a fair amount of the logic currently in c++ in unqualifiers into the language itself.

It hasn't been tested thoroughly, and needs tidying, but I wanted to get your view on if this is worth pursuing.

And we need some cacheing in here, that recordHeadLabel example above takes a noticeable amount of time to run each time.

kthielen commented 4 years ago

I am sorry if the Unqualifier business in C++ really did upset you. The main reason that I added those was to connect hobbes to existing systems where we could stage evaluation, but structured data files and network connections are good examples of where you'd want something like that (one stage to load a file and check what types it has in it, then another stage to evaluate its data).

I think that moving hobbes to a dependent type system is a really great idea. There are a lot of things that we could do with such a type system. It doesn't even have to be homotopy type theory, but we can always benefit from more descriptive types.

One thing that I struggle with a little bit is how to reconcile qualified types (like we have here where constraints in types get eliminated with Unqualifiers) with dependent types, so your proposal here is well timed and very interesting.

adam-antonik commented 4 years ago

I am sorry if the Unqualifier business in C++ really did upset you.

I'm not being serious. Unqualifiers are great, I use one as a type provider for sql queries for instance. But it does do my head in a bit moving back between hobbes and c++, and since quote type existed already, I thought I sh^H^Hcould abuse them. I like the staged evaluation, I just want to do it one place. With this approach, I can in the 1st stage create the type sig from the TString or whatever, then in the second use type classes on it to create the value I need.

In the file example, if we add a function that takes the deserialized type rep and shove it through the MonoTypePtrToExpr monstrosity, we could do all that fun file things entirely in staged hobbes

kthielen commented 4 years ago

Let's now do something with the ExprPtr representation of a type, and write recordHeadLabel

// hide any fields that aren't the head, and rename the head field to "head" rhvInner :: ([[char] Type], int, [[char] Type], bool) -> [[char] * Type] [...] And we need some cacheing in here, that recordHeadLabel example above takes a noticeable amount of time to run each time.

This gets at something that I think is really important and subtle about PLs like this, which is this phase distinction between types and terms (and I guess it goes all the way up the universe hierarchy, so types are prior to terms but "kinds" are prior to types, etc).

The reason that "recordHeadLabel" can be very efficient in its Unqualifier right now is that it's decided at compile-time. That's a kind of caching that I think we could capture in the language, but it's subtle. Having the hard division between types and terms is a little bit of a crutch because it gives us something to point to in order to decide when something will run. But compilers have always done a similar thing for terms where "constants" are involved (e.g. rewriting 7*7-7 at compile-time to 42). Maybe the answer is that we treat these Type constructions like constants (which they are) and have some better story for how to evaluate constants early.

adam-antonik commented 4 years ago

Let's now do something with the ExprPtr representation of a type, and write recordHeadLabel // hide any fields that aren't the head, and rename the head field to "head" rhvInner :: ([[char] Type], int, [[char] Type], bool) -> [[char] * Type] [...] And we need some cacheing in here, that recordHeadLabel example above takes a noticeable amount of time to run each time.

This gets at something that I think is really important and subtle about PLs like this, which is this phase distinction between types and terms (and I guess it goes all the way up the universe hierarchy, so types are prior to terms but "kinds" are prior to types, etc).

The reason that "recordHeadLabel" can be very efficient in its Unqualifier right now is that it's decided at compile-time. That's a kind of caching that I think we could capture in the language, but it's subtle. Having the hard division between types and terms is a little bit of a crutch because it gives us something to point to in order to decide when something will run. But compilers have always done a similar thing for terms where "constants" are involved (e.g. rewriting 7*7-7 at compile-time to 42). Maybe the answer is that we treat these Type constructions like constants (which they are) and have some better story for how to evaluate constants early.

The recordHeadValue_ I provide is still executed at compile time (I meant lag in the repl), it is just compile takes longer, as we call into the llvm as part of the compile. The problem is the lack of fundeps. This could be solved if I wrote TypeApply a f (arg1 arg2 ...), then 1,2 -> 0, but that makes me sad.

This approach does not abandon the multi-stage approach, it just allows writing stages in hobbes

Specifically, once compiled, latency is unaffected. Compile takes longer, noticeable in the repl

Compiling one line with an unqualifier, the unqualifier is call multiple times, each call involves here a call to the llvm. Caching on MonoTypePtr inputs would remove a lot of this and speed up compile, but not affect either way run-time speed. recordHeadValue_ at runtime is just cast and a Proj, no worse than the standard one.

kthielen commented 4 years ago

Oh I see, that's awesome! Yeah it should only need to compile and run once then, and maybe we turn down back-end optimizations a little (if the LLVM step is very expensive).

adam-antonik commented 4 years ago

Oh I see, that's awesome! Yeah it should only need to compile and run once then, and maybe we turn down back-end optimizations a little (if the LLVM step is very expensive).

Yeah, llvm is not fast. Maybe your standalone jit is what we need. Sadly looking at :x we see the llvm doesn't inline my version, but the timings aren't that different

> :e recordHeadValue_({a=2})
average over 1000 runs: 27ns
minimum runtime: 25ns
maximum runtime: 104ns
> :e recordHeadValue({a=2})
average over 1000 runs: 27ns
minimum runtime: 25ns
maximum runtime: 92ns
adam-antonik commented 4 years ago

BTW, I'm sorry but it is 9pm here on a Friday, I'm not going to fix the build now. I hope you can try out whatever the nix version is, which does build, to play around with this

kthielen commented 4 years ago

No problem, it was a small issue on some platforms/compilers and I fixed it.

We use hobbes in a variety of different projects with different levels of strictness, so I've tried to capture as many of those as I can to avoid finding out about them when those projects can't use a new version of hobbes.

I still need to look at this more closely, but I think that it's a very good idea and probably a step in the right direction.

adam-antonik commented 4 years ago

Just for fun, here's an example using this that isn't just a re-implementation of existing c++ code. This is a very simple printf type thing, where we use a quotedd string to generate a Type that then processes the arguments correctly.

primty x = roll(|Prim=(x, nothing :: (() + Type))|) :: Type
arrayty x = roll(|Array=x|) :: Type
tstringty x = roll(|TString=x|) :: Type
mktuplety x = roll(|Record=map(\v.(".f"++show(v.0), v.1), zip([0L..size(x)-1], x))|) :: Type

process :: ([char], int) -> ^x.(() + ((Type * Type * Type) * x))
process x idx = match x with
 | '(?<pre>[^%]*)(?<fmt>%s)(?<suf>.*)' -> cons((tstringty(pre), tstringty(".f" ++ show(idx)), arrayty(primty("char"))), process(suf, idx+1))
 | '(?<pre>[^%]*)(?<fmt>%d)(?<suf>.*)' -> cons((tstringty(pre), tstringty(".f" ++ show(idx)), primty("int")), process(suf, idx+1))
 | _ -> cons((tstringty(x), primty("unit"), primty("unit")), nil())

doPrintf x = let t = toArray(process(x, 0)) in mktuplety(map(\x.mktuplety([x.0, x.1, x.2]), t))

class PrintF a t where
  printf_ :: (a, t) -> ()

instance (a=((str * () * ()) * ())) => PrintF a _ where
  printf_ r t = putStr(typeValueLower(unsafeCast(())::str))

instance (a=((str * acc * [char]) * t)) => PrintF a _ where
  printf_ r t = do{putStr(typeValueLower(unsafeCast(())::str)); putStr(fieldValue(t)::(_/acc::_)=>_); printf_(tupleTail(r), t);}

instance (a=((str * acc * int) * t)) => PrintF a _ where
  printf_ r t = do{putStr(typeValueLower(unsafeCast(())::str)); putStr(showInt(fieldValue(t)::(_/acc::_)=>_)); printf_(tupleTail(r), t);}

mkPrintFType :: (TypeApply a `doPrintf` x) => x -> a
mkPrintFType x = unsafeCast()

printf fmt args = printf_(mkPrintFType(fmt), args)

Then

> printf(`"Hello %s %d!"`, ("World", 3))
Hello World 3!

Feeding in arguments of the wrong type into the tuple is a compile-time error

kthielen commented 4 years ago

That printf example is excellent. I see how it stages evaluation so that format string processing happens at compile-time (and the format string type is equivalent to the unit type in run-time representation, so it has no cost at run-time).

I wonder if the quote brackets are worthwhile or if it'd be better to avoid them (I'm just wondering, not saying that quote brackets aren't worthwhile). I added that quote syntax to do remote code execution, but have heard differing opinions about how clear it is to use (and have had a few people tell me that they have a hard time understanding "what happens when").

One good thing about the quote brackets is that it makes staging obvious (and people reading the code can have something to point to).

Another way we could do it would be to make something like a pi-type where named arguments could be implicitly quoted.

It's kind of tangential, but I'm curious to know what you think of working with quoted expressions versus alternatives.

adam-antonik commented 4 years ago

a hard time understanding "what happens when" where named arguments could be implicitly quoted

;) I don't think that's going to help matters.

I see that quotes were introduced for rpc, but I think they are underappreciated and could be more widely useful. (inputFile could take an arg of a quoted string for instance, and remove the need for the specific unqualifier constraint boilerplate).

As for your question, in afraid the answer is, as a pragmatic end-user, I don't really mind. I made this pr as functionality I wanted was annoyingly close, all the pieces were there, just needed joining. I don't have a well enough developed PLT sense of aesthetics to really be able to take an opinion. Quotes have the advantage of at least being very clear that something different is happening. Some modification to the language to make this more integrated might look nicer, but if it hides the staged nature of the computation, I, at least, might find it confusing. But again, I really don't know.

kthielen commented 4 years ago

That's good to hear, maybe the quote syntax is helpful as a visual cue that this part happens at an earlier stage.

This is a really good start and we will probably want to iterate on it for a while. I don't know if it's better to keep it open or merge it and iterate with future PRs (generally I tend to merge and iterate with new PRs).

There are cases for dependent types where you don't want the "quoted part" to go away at run-time. For example, you could have a dependent record type like {n:long, arr:[:t|n:]} where the n in the first field value appears in the type of the arr field (as the length of the array). This is basically what the hobbes variable-length array type does, but the dependent type is a more basic building block for that kind of thing -- "column storage" for example looks like {n:long, col0:[:t0|n:], col1:[:t1|n:], col2:[:t2|n:]} (ie: the length prefix is shared among all of the "column" arrays).

There are other cases like that where we can use dependent types to relate multiple values. Another useful one is for decidable equality, where a check x==y can give a type (x=y)+(x!=y) where the "shape" is equivalent to bool and the quoted variables have no run-time representation, but the result type carries information about the significance of the test (so we know either that x=y or that x!=y).

I hope that isn't discouraging to point out, but it's just to say that there's still some additional behavior we'd have to capture before we really said that we "have dependent types". This is a good step though.

kthielen commented 4 years ago

I think that we should add some tests around this, given that it's kind of a subtle feature and we want to capture the valuable things you've highlighted here. Maybe your examples could be turned into tests? That printf one especially is a great test, the type-level arithmetic also.

adam-antonik commented 4 years ago

That's good to hear, maybe the quote syntax is helpful as a visual cue that this part happens at an earlier stage.

This is a really good start and we will probably want to iterate on it for a while. I don't know if it's better to keep it open or merge it and iterate with future PRs (generally I tend to merge and iterate with new PRs).

There are cases for dependent types where you don't want the "quoted part" to go away at run-time. For example, you could have a dependent record type like {n:long, arr:[:t|n:]} where the n in the first field value appears in the type of the arr field (as the length of the array). This is basically what the hobbes variable-length array type does, but the dependent type is a more basic building block for that kind of thing -- "column storage" for example looks like {n:long, col0:[:t0|n:], col1:[:t1|n:], col2:[:t2|n:]} (ie: the length prefix is shared among all of the "column" arrays).

There are other cases like that where we can use dependent types to relate multiple values. Another useful one is for decidable equality, where a check x==y can give a type (x=y)+(x!=y) where the "shape" is equivalent to bool and the quoted variables have no run-time representation, but the result type carries information about the significance of the test (so we know either that x=y or that x!=y).

I hope that isn't discouraging to point out, but it's just to say that there's still some additional behavior we'd have to capture before we really said that we "have dependent types". This is a good step though.

Oh yes, this is far from full dependent types, (see PR title ;) ), but it allows us to write lots of things that dependent types would. The "quoted part" is available at run-time, through typeValueLower. Here is an array that knows its length at compile time, the size is taken from the type

data SA n t = [t]

emptySA _ = unsafeCast([]::[t]) :: ((SA false t))

instance Array (SA n t) t where
  size a = typeValueLower(unsafeCast(()) :: n)
  element a i = (unsafeCast(a)::[t])[i]
  elements a l u  = (unsafeCast(a)::[t])[l:u]

pushSA :: (TypeApply b `\x.x+1` a) => ((SA a t), t) -> ((SA b t))
pushSA a v = unsafeCast((unsafeCast(a)::[t]) ++ [v])

This seems close to the functionality (but not syntax) you mention above. (And we can add static lookup constraints)

class SAL a i t | a i -> t where
  element_ :: (a, i) -> t

instance (TypeApply 1 `\i n.if i < n then 1L else 0L` i n) => SAL (SA n t) i t where
  element_ a i = (unsafeCast(a)::[t])[typeValueLower(unsafeCast(()) :: i)]

Then

x = pushSA(emptySA(), 3)
> element_(x, `0`)
3
> element_(x, `1`)
stdin:1,1-16: Unsatisfiable predicate: SAL (SA 1L int) (quote 1) a

My plan here is to work from what can be easily added to allow a dependently typed program to be mechanically translated into something that compiles. Once every thing can be done like that, the final step is just syntatic sugar.

adam-antonik commented 4 years ago

I do get what you say though, for instance loooking over an array and pushing elements onto such a static array does not work, let alone if we did pushed to two arrays and then demanded proof they were equal. But by adding this limited type level application here, I think it might be become a bit clearer how to solve that in the current framework. The other option is to rewrite an just do it, but that seems more work

kthielen commented 4 years ago

printf("Hello %s %d!", ("World", 3))

Trying this example with an LLVM 3.5 debug mode build on macOS, for some reason the process hangs and eventually gets killed. But when I run it under lldb, it runs reasonably quickly and prints the expected result. Did you see anything like that?

ETA: Tried it with LLVM 6 debug mode, ran totally fine (much faster to compile actually). This might be an issue with 3.5, kind of annoying. I'll see if I can put a test in at least.

kthielen commented 4 years ago

I hope you don't mind me poking around your PR, but I added a unit test based on your printf example (changed to sprintf, so I get a value back I can validate). For new features like this, I think it's helpful to have some tests to show how it's supposed to work.

adam-antonik commented 4 years ago

printf("Hello %s %d!", ("World", 3))

Trying this example with an LLVM 3.5 debug mode build on macOS, for some reason the process hangs and eventually gets killed. But when I run it under lldb, it runs reasonably quickly and prints the expected result. Did you see anything like that?

ETA: Tried it with LLVM 6 debug mode, ran totally fine (much faster to compile actually). This might be an issue with 3.5, kind of annoying. I'll see if I can put a test in at least.

I'm afraid I haven't seen anything like that, somewhat to my surprise this whole thing went very smoothly.

adam-antonik commented 4 years ago

I hope you don't mind me poking around your PR, but I added a unit test based on your printf example (changed to sprintf, so I get a value back I can validate). For new features like this, I think it's helpful to have some tests to show how it's supposed to work.

Please feel free to do make whatever modifications make sense. I've got no great attachment to the code other than I like what can be done with it. Totally agree that tests for this sort of thing is necessary. If the overall approach makes sense, I'll try to add a couple more when I get time.

kthielen commented 4 years ago

Are you happy with this change? I think that it's pretty small and isolated, basically just adding these two functions and a type definition. If you'd like to use it, I'm ok with merging it.

kthielen commented 4 years ago

Re: caching, I did something like this on an internal change, might be pushing it to github fairly soon. You can rely on the fact that type structural equality implies pointer equality and vice versa -- which makes it easy to cache by type.

kthielen commented 4 years ago

Thinking about this a little bit more, I think that this could actually be very useful for us to squeeze even more efficiency out of the case where we want to convert our structured logs back to text (maybe not the most exciting use-case, but a valuable use-case anyway).

For structured logs (as produced by the header-only fregion.H lib) we consume this giant variant type over all possible log statements, passed through variantApp to do efficient generic deconstruction (basically the same as what you'd get with a non-generic match or case deconstruction). This is already very efficient, but we have some overhead by using closures for this and capturing as much context out of the "static" format string (static at the point that the file is loaded), when we could instead map those format strings into functions that are as efficient as possible (up to just the dynamic data that decides per-statement what the output string should look like).

smunix commented 4 years ago

This is sweet, the ability for hobbes to support dependent types is fantastic. @adam-antonik, will you mind adding a section about this in our documentation as well ?

:+1: for choosing printf to demonstrate the use-case here, they succ (pun intended) less than the usual peano natural numbers demos.

adam-antonik commented 4 years ago

Sorry for disappearing from this recently. I hope to have time to get back to this soonish. This isn't abandoned by me at all, but boring work has taken priority this month.

kthielen commented 4 years ago

No rush!

kthielen commented 4 years ago

I just noticed that it appears that the builds/tests are stuck for this PR although if you click through you'll see that they all ran and passed.

I think that the reason is that you sent the PR from the master branch of your fork. I have hit this same problem before, and have found that I can't get travis-ci to indicate status correctly unless I submit from a non-master branch off of my fork.