Nim goto intermediate representation (GIRN)

Araq commented 1 year ago

Abstract

I outline a new IR for the Nim compiler that aims to reduce the compiler's overall complexity.

Motivation

Nim's IR is based on its AST: While the AST served us well in the past, it's now clear that many of Nim's features like "definite assignment analysis", "nil checking" or its "move analyser" would be easier with a different program representation. GIRN is that different representation.

Description

I called it the "Goto intermediate representation for Nim (GIRN)". The type system is left untouched and everything is strictly typed. GIRN focuses on a control flow abstraction. Function calls can remain nested. Compared to the AST GIRN contains these major changes:

A proc body is a single list of instructions and control flow is mapped to labels and goto. The goto instruction is designed/handicapped to keep the control flow graphs reducible so that no complex fixpoint algorithms are required.
Additional computed information is stored alongside the instructions:
- kill x. Indicates that the scope of x has ended and that x will not be used afterwards.
- assume cond. At this point we know that cond holds.
- require cond. At this point we need to prove that cond holds.
- assumeHasValue x. After this point x will have a value (but it's not specified which one).
Loop labels and goto loopLabel are distinct from labels which are the targets of forward jumps.
There are only case statements, no if statements. A backend can easily convert a case statement that only has one or two branches into an if statement.

GIRN has the following properties:

Easy to do inlining.
Easy to do "nil" safety checking.
Easy to do array bounds checking.
Easy to do "is last use" computation.
Easy to do "borrow checking".
Easy to perform peephole optimizations.
Easier to implement closure iterators.

The JS backend would benefit from reliable frontend inlining and once we have it we can properly distinguish between .inline which should be always inline and .hot and .cold procs.

Code Examples

The examples here focus on the "move analyser". The move analyser does both control flow and data flow analysis.

The data flow aspect needs assumeHasValue as a summary instruction in order to be effective:


use x # last usage of `x` because `assumeHasValue x` follows:
case cond
of true:
  x = value
else:
  passToOutParameter x
# computed:
assumeHasValue x
use x

Loops are interesting: If the loop "comes after" the definition of the variable it is an "inner loop" and we have to watch out:

var x = foo()
while cond:
  use x     # this is not the last use!

Is translated into:

var x = foo()
loopLabel L1
case cond
of false: goto LoopExit
use x
jmpBack L1
label LoopExit
kill x

It is reasonably easy to see that use x is not the last use.

Compare that with:

while cond:
  var x = foo()
  use x     # this is the last use!

Which is translated to:

loopLabel L1
case cond
of false: goto LoopExit
var x = foo()
use x
kill x
jmpBack L1
label LoopExit

Again, it is easy to see what is really going on. kill instructions model the scopes effectively and also help with destructor injections.

Backwards Compatibility

There are no incompatibilities as it only affects the passes after macro expansions.

Open problem

How to model try finally so that it can still be translated to try for the JavaScript and C++ backends. Aside from that exceptions are simple to model via an explicit on error: goto errorHandler instruction after a function call that can throw.

juancarlospaco commented 1 year ago

It totally sounds like the n-word, what about NIR ?. 🤔

It will be like a file written to disk or sqlite or just an object in memory?.

There are only case statements, no if statements. A backend can easily convert a case statement that only has one or two branches into an if statement.

So it has to convert if into case and then back to if again ?.

Araq commented 1 year ago

So it has to convert if into case and then back to if again?

Yes.

zevv commented 1 year ago

@Araq: please comment on "It totally sounds like the n-word, what about NIR ?" as well.

ZoomRmc commented 1 year ago

Shouldn't it read more like "anger" then? Just add a 'b' if you're concerned: "Nim goto-based intermediate representation": "Angie-Beer".

Well, the fact I'm more inclined to discuss the name shows how qualified I am to discuss the matter of the RFC.

Araq commented 1 year ago

@Araq: please comment on "It totally sounds like the n-word, what about NIR ?" as well.

En-gee-I-AR, doesn't sound like the n-word at all to me and I made no such connection when I made it up.

beef331 commented 1 year ago

In a world where people pronounce JSON "J-son" or GUI "Gooey", perhaps it's not pronounced as it's abbreviation.

ee7 commented 1 year ago

I agree the acronym is unfortunate. The fact that it has already been a distraction in this RFC shows that it is likely to continue to be a distraction.

Is there a problem with NIR?

En-gee-I-AR, doesn't sound like the n-word at all to me

This is similar to "FCKR: eff-see-kay-arr - doesn't sound like the f-word".

First, it's not all about sound: visual similarity also matters.

Second, acronyms in English can be pronounced without spelling out the letters. For example:

ASAP (ay-sap)
AWOL (ay-wol)
ISO (eye-so)
JPEG (jay-peg)
LOL (usually not ell-oh-ell)
MIDI (mid-ee)
YOLO (yo-lo)
ZIP (not zee|zed-eye-pee)

moigagoo commented 1 year ago

It totally sounds like the n-word, what about NIR ?. 🤔

I thought I was the only one who saw it 😁 I thought Araq must have called it Nim Intermediate Goto Representation but that would be a horrible acronym so he changed it slightly.

Personally, I don't think it's a problem. If anyone chooses to see slurs where there's none, it's their personal issue.

However, if something must be done here, I suggest GIR. Python's GIL isn't called PGIL, it's just GIL. So, Nim's GIR can just be GIR 🤷

Araq commented 1 year ago

For the sake of moving forward I renamed it to GIRN but this is ridiculous.

zacharycarter commented 1 year ago

Even if it seems ridiculous, which if we're going to pick sides regarding how to feel about the naming, I would err on - amending the acronym is and was the safe bet. People are way too sensitive and polarized around identity at the moment. We've also been down this road as a community before and also for some of us as individuals.

Thanks for taking quick action on this. Also, a side note for everyone reacting to the original acronym: even if we find something offensive personally, that doesn't imply that the person you take offense from was acting with any specific intent or malice. Remember we are all human beings before anything else and we're all deserving of one another's respect.

Also, kudos on the compiler factoring. It's always nice to read about complexity being reduced.

beef331 commented 1 year ago

Given assumeHasValue has hasValue sym, val been considered? I think it'd allow value narrowing enabling more compile time checks and evaluations.

Araq commented 1 year ago

Given assumeHasValue has hasValue sym, val been considered? I think it'd allow value narrowing enabling more compile time checks and evaluations.

That's already covered via nkAsgn x, value.

arnetheduck commented 1 year ago

How to model try finally

Exceptions in llvm are modeled by making function call act as a conditional, essentially, with two return points, pretty much like an if statement: one for the normal case and one for the exceptional case - modelling it this way is similar to the boolean "if err" after each function call, but I suspect dedicating a call type (llvm calls it invoke) for exception handling is more compact and leads to less information loss.

One side effect of this strategy is that finally code "blocks" look like they get duplicated in both the happy and unhappy case - this is "more or less" what finally means, that you run the same code in both branches: while on the surface this may seem like duplication, it is not - the context in which finally runs during exception handling is different from that the "happy" case exit and different rules apply (ie different things have happened leading up to that point, including which variables are live and the state of globals and other observable side effects) - in particular, you will need to decide on the semantics of raising inside finally (or inside destructors), which is fraught with issues as the many bugs in Nim in this area show.

can still be translated

translations like these lead to information loss - while some of that information can be recovered with analysis, such analysis slows down the compiler and still sometimes recreates imperfect information - this is something to consider in general: the semantics, and in particular the constraints, of the language offer opportunities that easily get lost in translation ("this is a copy-on-write construct, so read-only copies are cheap") - we can mitigate this to some extent, but like the try/finally most clearly shows, there will be other cases that are more subtle.

Araq commented 1 year ago

One side effect of this strategy is that finally code "blocks" look like they get duplicated in both the happy and unhappy case

That's one thing I wanted to avoid, yes, but the question remains how to model it so that it can easily be recovered for JS/C++ codegen.

elcritch commented 1 year ago

One side effect of this strategy is that finally code "blocks" look like they get duplicated in both the happy and unhappy case

That's one thing I wanted to avoid, yes, but the question remains how to model it so that it can easily be recovered for JS/C++ codegen.

So this wouldn't be as simple as having try tags or whatnot? I'm guessing it'd require going through multiple blocks to reconstruct? What about a "try(blocks: a, b, d)" meta tag? Though I haven't had a chance to read through the details. Stil this goto form seems pretty nifty!

One question I had -- does this give more opportunities to do static analysis or would that primarily remain in the AST level?

P.S. thanks for the name change. It's ridiculous but just avoids a whole potential can of worms given things in the USA, and was like the second thing that came to mind.

Araq commented 1 year ago

Maybe it works with more special kinds of labels:


try:
  f(a)
finally:
  echo "done"

becomes (roughly):

TryLabel Lunused
f(a)
case raised
of true: goto L1
FinallyLabel L1
echo "done"
goto ProcEnd
label ProcEnd

elcritch commented 1 year ago

That seems plausible? It'd at least keep the information in the IR. Though, would it require backends to have to handle the TryLabel, FinallyLabel types or would they just be metadata?

Araq commented 1 year ago

The backends need to handle them but it's a weird question, they exist for the backends.

arnetheduck commented 1 year ago

The backends need to handle them but it's a weird question, they exist for the backends.

speaking of labels, LLVM has a similar mechanism (IR metadata) which basically allows attaching any kind of metadata to any IR node - it is used for transporting pass-specific metadata to later stages, for example type information to the type-based alias analysis pass - language frontends that are statically typed can "opt in" to generate the type data - similarly, extended control-flow information or "known ranges of values" could be propagated across passes which would alleviate the need to recover them via costly analysis but still allow analysis to enhance the IR along the way.

Having a "generic" metadata passing capability is something nlvm certainly would benefit from (so that it can attach backend-specific data to nodes) - here's how I imagine it would work:

backends declare what kind of metadata they require
the frontend, while transforming the code, emits the requested kind of metadata
passes that transform code by default strive to preserve metadata on a "best-effort" basis

Why only "requested"? TBAA is a good example: if you're compiling without optimizations, you don't want to waste time on generating and storing type information which later will be ignored.

elcritch commented 1 year ago

I've been following the OCaml Effect handler's for a while and something about GIRN & goto's tickles my brain as having possible overlap. Perhaps it's just the idea that with the OCaml style effects non-local control flow, exceptions, async, et al are just instances of the same underlying mechanism.

Would there be a way to do something similar with the goto's and perhaps a set of metadata for non-local jumps? It might be possible to re-use it for analyzing closure iterator's used for async as well. Perhaps not for runtime composable effects, but as an underlying coherent schema for non-local control?

Araq commented 1 year ago

I've read the paper and it doesn't really apply. It uses a new runtime and a type (or rather effect) system extension. GIRN is much simpler (to implement).

GIRN is also not a CPS representation as for that you need the notion of a parametrized goto instruction like goto fn(a, b, c).

elcritch commented 1 year ago

GIRN is also not a CPS representation as for that you need the notion of a parametrized goto instruction like goto fn(a, b, c).

Ah makes sense that GIRN wouldn't encode the CPS-level representation. Just more of a hopeful thought than anything too serious -- though it was interesting to consider how CSP-level info could be used to provide non-exception effect handlers for handling things like overflows.

On another note, it'd be interesting how the kill, assume, require could be used in a SPARK like system for formal verifiers or runtime verifiers. Maybe DrNim would be more feasible? Though I imagine the kill and them might oriented more towards lifetimes.

Also, it's completely outside of my current wheelhouse so I'll let y'all get back to practical compiler work. :-)

mratsim commented 1 year ago

Note: don't mix CPS (continuation-passing style) and CSP (communicating sequential processes)

elcritch commented 1 year ago

Eh, thanks I'm guilty -- fixed my post.

hugosenari commented 3 months ago

NIR is GIRN that was NGIR :black_cat: ?

nim-lang / RFCs