Open brendanzab opened 6 years ago
@evincarofautumn is experimenting with a build system style of query-driven architecture:
You can model it as a build system, where the artifacts are data structures and the tasks are compiler passes—although I don’t know if that’s new or just new to me. This naturally extends to things like caching and incremental builds as long as your data structures are granular enough; that is, you don’t necessarily need to use explicitly incremental data structures, you just need to be able to divide the work into small chunks which are not transformed incrementally. This model seems to make things like concurrency and logging/tracing easy to add, especially in Haskell
By the first point I mean that, instead of “this function changed, therefore recompile the whole file” you get “this function changed, therefore recompile that function” but you still have “relink the whole file”—if you had fully incremental data structures, you’d get “modify just the part of the linker output that changed”. Seems to be a good compromise for me in terms of understanding the implementation but getting pretty good performance and interactivity, but we’ll see
Incrementurtles all the way down sounds appealing but I don’t know if I want to go down that rabbit hole…turtle hole…you know what I mean. Especially because my intuition is that the overhead of using pervasively incremental data types and figuring out how to apply changes might be more costly than just doing full rebuilds (of smallish things)
https://gitter.im/pikelet-lang/Lobby?at=5b58fe9632d98c2ed2b58eac
@graydon on twitter:
The (very slowly emerging) consensus I’m seeing is to structure this mid section roughly like some variant of an efficient datalog engine. To whatever degree of sophistication in caching, parallelism and incrementalism your implementation can manage.
I think it's tempting to try to get away with a multi-pass, straight-line compilation model and tbqh I'm still sympathetic enough to it to try to design languages that are amenable. But if the language design doesn't fit, it's a costly mistake to build the compiler that way.
(It's not at all clear to me that users are that hostile to the constraints imposed on a language by designing for it. Suspect the benefits in speed and simplicity of compilation might persuade them! But we often write cheques for ease-of-use features that we must then cash.)
@Blaisorblade on twitter
Do you count Dotty’s “compilers are databases” under this consensus? Dotty also fits this thread because basically the whole program is a giant mutually recursive object, so your typechecker needs to be (I suspect) a productive corecursive program:
The type of each decl. is a lazy thunk that processes the given type or infers one. Forcing it might force other thunks (either to infer the type from the body, or to force declarations used in the type). It’s an error if the thunk forces itself before creating a result.
To some extent that’s already captured by DOT’s typing rule for objects, tho this rule should be coinductive instead.
Γ, self : T ⊢ decls : T ———————————————————————————————————————————————— Γ ⊢ new { self => decls } : mu (self => T}
Having reverse-engineered this from Dotty, I wonder how it compares with other compilers. But duplicating validateDecl doesn’t seem a problem here. OTOH, blog posts on Kentucky Mule (rsc’s father) suggest that such laziness is a significant performance problem...
More twitter chats:
Graydon:
I would say that most compilers I've seen are very embarrassingly structured and don't adequately isolate/track dependencies at the crude level they "ought" to be able to from the language semantics. So wind up redoing vast amounts of work every run.
Much of this, in turn, derives from the extremely convoluted rules for name lookup in virtually every real language. If you can't easily and precisely figure out what a source entity depends on name-wise, you're cooked, gotta back off to the whole TU. Which is really common!
Like if I had one piece of advice for language design it'd be to keep your name lookup rules as simple as possible, not intertwined with type checking, not involving any forms of global overload disambiguation, argument-dependence or whatever. It's a hairball almost everywhere.
https://twitter.com/graydon_pub/status/1039793635175190528
Me:
Hum, interesting. Might using dependent records for my module system make this horrific?
Graydon:
Might. Not sure. Worth considering in some detail: you want def/use dependence relation to snap into focus cheaply, not involve either "searching an unbounded number of places" or "doing a lot of work -- importing or deduction -- at each place you have to look" for each name.
Like do a thought experiment where you change only one def with only one use in a library with a million defs, and figure out how small you can make the constant (ideally zero) on the O(million) work you're going to do to figure out there's only one use, and exclude the rest.
Blaisorblade:
So
@gkossakowski
’s Kentucky Mule relates exactly to this problem, as it’s about building very quickly the “symbol table”, which is sequential, and then compile bodies independently. No chance for return type inference tho! (And he complains about Scala lookup rules, too!)
Looking at Salsa - it's still early days for it, but it looks like it will be close to what we'll need! Here's an example.
Speaking to Niko, it seems that Salsa doesn't handle 'sub-term' incrementalism. This could be an issue given we are going with a 1ML-style of language, where modules are just part of the term language.
Just added a link to @ollef's description of incremental compilation In Sixten.
Lucent by @Techno-coder is moving to a query-driven approach: https://github.com/Techno-coder/lucent/blob/next-prototype/src/query/context.rs
It's a hand-rolled version of the rustc approach, rather than using something like Salsa.
Interesting comment from @matklad: https://lobste.rs/s/afqbdk/individual_element_thinking_vs_grouped#c_gns5uo
Finally, to speak about something I actually do, rust-analyzer is rather game-like. There’s a bunch of “data”, there’s a main loop accepting input from the client, doing “game update” and “rendering” the end result (completions, highlighting and such) to the client. It pains me that today this works by having millions individually allocated things pointing at each other. I envy sorbet’s (type checker for Ruby) architecture, where they have a
GlobalState
, which holds a bunch of flat vectors, and that’s basically it. Sadly, doing just that in rust-analyzer is not trivial, as we have a pretty intricate incremental computation setup, which, in the current incantation, pretty much forces a tree of individually owned objects.
Links to Sorbet's GlobalState
type:
And here is a blog post that talks about some of the design decisions: Why the Sorbet typechecker is fast.
Doing something like this might be nicer than using Salsa if we want the Pikelet compiler to be embeddable in other programs.
Before we go too far down the path of building a traditional compiler (#9), it probably makes sense to start thinking about how we might incrementalise things. This will be super important for supporting a good editor experience (#97). If we put this off for too long we might end up having to rebuild a bunch - this is what Rust is facing, for example.
Without knowing much about it, perhaps something like the Incremental Lambda Calculus would be handy for this. We could try to find the derivative of each pass of our compiler, based on the result of a previous run. This could also be a helpful framework for formalising our incremental compiler as well (#39)!
CRDT-style data structures could also be of potential use, and perhaps projects like timely-dataflow and differential-dataflow.
Unresolved Questions
What is the difference between 'incremental' vs. 'query-driven'?
Resources