nim-lang / RFCs

A repository for your Nim proposals.
136 stars 23 forks source link

Incremental compilation #46

Closed Araq closed 3 years ago

Araq commented 6 years ago

Currently in larger Nim projects, the Nim compiler is the bottleneck, not the C/C++ compiler. This is caused by the fact that the Nim compiler always recompiles the full program including all used libraries. This is not required, Nim should use a "compilation cache" so that only the modules that were changed (plus maybe the modules that depended on these modules) are recompiled. The current idea is to base this compilation cache on an SQLite database. This feature is crucial for Nim's scalability to bigger projects.

Araq commented 6 years ago

The implementation of the compilation cache is tricky: There are lots of issues to be solved for the front- and backend.

General approach: AST replay

We store a module's AST of a successful semantic check in a SQLite database. There are plenty of features that require a sub sequence to be re-applied, for example:


  {.compile: "foo.c".} # even if the module is loaded from the DB,
                       # "foo.c" needs to be compiled/linked.

The solution is to re-play the module's top level statements. This solves the problem without having to special case the logic that fills the internal seqs which are affected by the pragmas.

In fact, this decribes how the AST should be stored in the database, as a "shallow" tree. Let's assume we compile module m with the following contents:


  import strutils

  var x*: int = 90
  {.compile: "foo.c".}
  proc p = echo "p"
  proc q = echo "q"
  static:
    echo "static"

Conceptually this is the AST we store for the module:


  import strutils

  var x*
  {.compile: "foo.c".}
  proc p
  proc q
  static:
    echo "static"

The symbol's body is loaded lazily, on demand. This is where most savings come from, only the shallow outer AST is reconstructed immediately.

It is also important that the replay involves the import statement so that the dependencies are resolved properly.

Shared global compiletime state

Nim allows .global, compiletime variables that can be filled by macro invokations across different modules. This feature breaks modularity in a severe way. Plenty of different solutions have been proposed:

(These solutions are not mutually exclusive.)

Since we adopt the "replay the top level statements" idea, the natural solution to this problem is to emit pseudo top level statements that reflect the mutations done to the global variable. However, this is MUCH harder than it sounds, for example squeaknim uses this snippet:


  apicall.add(") module: '" & dllName & "'>\C" &
              "\t^self externalCallFailed\C!\C\C")
  stCode.add(st & "\C\t\"Generated by NimSqueak\"\C\t" & apicall)

We can "replay" stCode.add only if the values of st and apicall are known. And even then a hash table's add with its hashing mechanism is too hard to replay.

In practice, things are worse still, consider someGlobal[i][j].add arg. We only know the root is someGlobal but the concrete path to the data is unknown as is the value that is added. We could compute a "diff" between the global states and use that to compute a symbol patchset, but this is quite some work, expensive to do at runtime (it would need to run after every module has been compiled) and also would break for hash tables.

We need an API that hides the complex aliasing problems by not relying on Nim's global variables. The obvious solution is to use string keys instead of global variables:


  proc cachePut*(key: string; value: string)
  proc cacheGet*(key: string): string

However, the values being strings or json is quite problematic: Many lookup tables that are built at compiletime embed proc vars and types which have no obvious string representation... Seems like AST diffing is still the best idea as it will not require to use an alien API and works with some existing Nimble packages, at least.

On the other hand, in Nim's future I would like to replace the VM by native code. A diff algorithm wouldn't work for that. Instead the native code would work with an API like put, get:


  proc cachePut*(key: string; value: NimNode)
  proc cacheGet*(key: string): NimNode

The API should embrace the AST diffing notion: See the module macrocache for the final details.

Methods and type converters

In the following sections global means shared between modules or property of the whole program.

Nim contains language features that are global. The best example for that are multi methods: Introducing a new method with the same name and some compatible object parameter means that the method's dispatcher needs to take the new method into account. So the dispatching logic is only completely known after the whole program has been translated!

Other features that are implicitly triggered cause problems for modularity too. Type converters fall into this category:


  # module A
  converter toBool(x: int): bool =
    result = x != 0

  # module B
  import A

  if 1:
    echo "ugly, but should work"

If in the above example module B is re-compiled, but A is not then B needs to be aware of toBool even though toBool is not referenced in B explicitly.

Both the multi method and the type converter problems are solved by the AST replay implementation.

Generics

We cache generic instantiations and need to ensure this caching works well with the incremental compilation feature. Since the cache is attached to the PSym datastructure, it should work without any special logic.

Backend issues

However the biggest problem is that dead code elimination breaks modularity! To see why, consider this scenario: The module G (for example the huge Gtk2 module...) is compiled with dead code elimination turned on. So none of G's procs is generated at all.

Then module B is compiled that requires G.P1. Ok, no problem, G.P1 is loaded from the symbol file and G.c now contains G.P1.

Then module A (that depends on B and G) is compiled and B and G are left unchanged. A requires G.P2.

So now G.c MUST contain both P1 and P2, but we haven't even loaded P1 from the symbol file, nor do we want to because we then quickly would restore large parts of the whole program.

Solution

The backend must have some logic so that if the currently processed module is from the compilation cache, the body is not accessed. Instead the generated C(++) for the symbol's body needs to be cached too and inserted back into the produced C file. This approach seems to deal with all the outlined problems above.

Araq commented 6 years ago

Appendix: macrocache.nim


## This module provides an API for macros that need to collect compile
## time information across module boundaries in global variables.
## Starting with version 0.19 of Nim this is not directly supported anymore
## as it breaks incremental compilations.
## Instead the API here needs to be used. See XXX (wikipedia page) for a
## theoretical foundation behind this.

type
  CacheSeq* = distinct string
  CacheTable* = distinct string
  CacheCounter* = distinct string

proc value*(c: CacheCounter): int {.magic: "NccValue".}
proc inc*(c: CacheCounter; by = 1) {.magic: "NccInc".}

proc add*(s: CacheSeq; value: NimNode) {.magic: "NcsAdd".}
proc incl*(s: CacheSeq; value: NimNode) {.magic: "NcsIncl".}
proc len*(s: CacheSeq): int {.magic: "NcsLen".}
proc `[]`*(s: CacheSeq; i: int): NimNode {.magic: "NcsAt".}

iterator items*(s: CacheSeq): NimNode =
  for i in 0 ..< len(s): yield s[i]

proc `[]=`*(t: CacheTable; key: string, value: NimNode) {.magic: "NctPut".}
  ## 'key' has to be unique!

proc len*(t: CacheTable): int {.magic: "NctLen".}
proc `[]`*(t: CacheTable; key: string): NimNode {.magic: "NctGet".}

proc hasNext(t: CacheTable; iter: int): bool {.magic: "NctHasNext".}
proc next(t: CacheTable; iter: int): (string, NimNode, int) {.magic: "NctNext".}

iterator pairs*(t: CacheTable): (string, NimNode) =
  var h = 0
  while hasNext(t, h):
    let (a, b, h2) = next(t, h)
    yield (a, b)
    h = h2

This is the upcoming API that replaces the .global, compileTime vars that are currently available for Nim's macro system, but need to be deprecated since they cannot work with incremental compilation.

Please try to imagine and tell me if your current macros could be based on this API. :-) Note again that only global compileTime variables that are used across module boundaries are affected. The vast majority of macro code out there is not affected.

zah commented 6 years ago

Araq and I have discussed this issue many times in private conversations, so I'll give a bit more background information.

The main problem with global compile-time variables is that they should be deterministic in the presence of arbitrary module recompilations. For this to work reliably, every time you recompile a module you must first undo all of its previous effects on the global variables and then apply the effects produced from the changed code. Undoing and redoing changes to a data structure is not trivial because it may depend on the order of operations (imagine a simple set type with incl and excl being used from multiple modules that can be reordered in arbitrary way).

Luckily, there is a category of data structures which support manipulation from multiple participants in arbitrary order always arriving at the same result - the so called Conflict-free replicated data types. The CRDT structures include counters, sets and other useful primitives that can serve as the basis for implementing everything else (from Set Theory, we know that sets are enough to represent any concept).

Thus, to increase modularity, the proposed solution is to limit the interaction with global compile-time variables to a set operations over CRDT types which will be recorded by the compiler. This could work by blessing certain types for use in compile-time vars or by introducing a low-level API like the one described above allowing you to build abstractions on top of it.

It should be noted that just manipulating the compile-time variables correctly is not enough for implementing any generative programming idea. The user must also be able to generate code that is derived from the final "program-wide" value of the variable. I usually call such generative programming approaches the "Gather and Generate" pattern. Here are some examples:

1. Language bridge library

A library like nimpy or NimBorg may gather all Nim types that are being passed to Python and Lua functions (by adding them to a global compile-time set as a side-effect from instantiating a generic dot operator). It will then produce a registerNimTypes(foreignVM) proc that will register all the run-time type information required by the foreign VM for working with the Nim values.

Initially registerNimTypes exists as an empty placeholder proc (it can be marked in the code with the partial pragma). Once the final value of the global compile-time variable is determined, a "sealing" compile-time proc is executed that will generate the body of registerNimTypes.

2. RPC library

A RCP library may allow you to create RCP end-points spread across several modules. A rpc macro is used to register them in a global compile-time set. In the sealing proc, the set or RPC procs is sorted by alphabetical order or by time of introduction (this can be achieved with a staticWrite API that will automatically update a file stored in version history). The protocol is versioned by hashing the signatures of all RPCs and each of them is assigned a numeric ID for maximum efficiency (from the sorter order). The numeric ID is written to a placeholder symbol referenced in the generated code of each RPC call.

3. User-defined multi-methods, HTTP requests router, and so on.

These are quite similar to the RPC library. The sealing proc can traverse the hierarchy of types to generate the required run-time dispatching logic, it can gather all HTTP routes and compile a single Ragel state machine for the most efficient routing possible and so on.

This Gather and Generate technique is one of the most common generative programming patterns and I have a lot of experience simulating it with global run-time variables and type erasure tricks in both C++ and Nim with the currently available functionality.

zah commented 6 years ago

So, from the description above and Araq's current proposal, we need to address the following:

  1. How does the user introduce a new instance of a shared data structures (CacheSeq, CacheTable, CacheCounter, etc)? Is this done by introducing a variable?

  2. Are these types CRDT data structures? How are they implemented?

  3. How is the "sealing" stage exposed to the user?

  4. Modifications to global variables can be introduced from static: code executed once per generic instantiation. For this to work, the instantiations need to be reference counted (each client module increases the count). Once a side effect is introduced, it should be removed only when the count drops back to zero. Does this play well with the rest of the proposal?

  5. If the lower-level API is based on NimNodes, can we have higher-level APIs that are strongly typed?

  6. To complete the feature, we also need the staticWrite magic for creating files that will go into version control. This is important, because it allows you to produce "cached" artifacts that will affect your co-workers as well.

Araq commented 6 years ago

const
  key = CacheCounter"package.module.myCounter" # obviously we can hide that better later.

static:
  inc(key, 2)
  inc(key)
  1. The table is an ordered BTree, the seq is a sequence with "set" semantics, but some order is exposed as you need to be able to iterate over it.
  2. There is no sealing stage.
  3. I don't understand this question.
  4. Probably, but under the hood it can only work with NimNodes.
  5. Ok, that seems rather easy to add.
tavurth commented 6 years ago

As far as I understand from this thread, we're going to see a huge speed increase for this in the future.

It seems like most of the slowdown is being caused by the top level statements being replayed on each compilation.

Does this mean that by reducing the number or complexity of top-level statements, we can reduce the processing time required for each module?

Example of current implementation method:

## ------------
## b.nim
import some_expensive_library

proc add*(a: int, b:int): int =
    return a + b

## ------------
## a.nim
import b

echo add(a + b)

Would any performance improvement be given by for example first compiling b.nim to a dylib or so and then writing a small wrapper which interfaces with the compiled library?

## ------------
## b.nim
import some_expensive_library

proc add*(a: int, b:int): int {.exportc: "add"} =
    return a + b

b.nim is then precompiled into ./lib/b.so.

## ------------
## b_wrapper.nim
proc add*(a: int): int {.importc, dynlib: "./lib/b.so".}

## ------------
## a.nim
import b_wrapper

echo add(a + b)

Would this perhaps speed up development processing cycles?

Araq commented 6 years ago

@tavurth This describes the "simplest thing that can possibly work" and later on we have more ideas of how to make the compiler even more incremental and avoid the replaying of top level statements. However, I doubt DLLs are a viable solution to this problem as the code/AST must be available so that the compiler can perform the many compiletime evaluation features that Nim offers.

xomachine commented 6 years ago

That is my use-case: The NESM library, I'm working on, uses global variable "ctx" which contains closures for generating [de]serialization code for all previously encountered types. The closures are used to subtitute the object name in the generated code like:

proc serialize(source: NimNode): NimNode = ... # the closure
# it generates folowing code
# for source = IdentNode("a")
a.writeData(...)
# for source = DotExpr(IdentNode("a"), IdenNode("b"))
a.b.writeData(...)
# and so on...

It can be rewritten to use only NimNode (with some efforts though), but I need a contains proc for the CacheTable. And I want to know if CacheTable can be used as argument to the function located in the other module.

# moduleA.nim
proc useCache*(a: CacheTable) {.compileTime.} =
  ...

# moduleB.nim
from moduleA import useCache
const x = CacheTable("mycachetable")
...
useCache(x)

PS: it would be nice to have an ability to overwrite the cached value from the CacheTable at the compile time without 'key already exists' error.

zielmicha commented 5 years ago

I have put a small bounty ($50) on Bountysource on this issue. (https://www.bountysource.com/issues/68171609-incremental-compilation)

metagn commented 3 years ago

3 questions for clarification:

  1. Are macrocache items supposed to be mutable? Because the current implementation allows this, like so:

    import macros, macrocache
    static:
     let cache = CacheSeq"foo"
     cache.add(newTree(nnkPar))
     cache[0].add(ident"abc")
  2. Will using compileTime globals instead of macrocache still be allowed at the cost of breaking incremental compilation? The macrocache documentation says it "is not directly supported anymore" and I'm only guessing that means "problems that compiletime globals have that macrocache can fix will not be addressed" rather than "compiletime globals should never be used". I don't support the latter because macrocache is a restriction on the power that macros have and there are some things that it might not be able to do or won't support that you might want to do.

  3. the seq is a sequence with "set" semantics, but some order is exposed as you need to be able to iterate over it.

    This means that they have order, however they won't store identical references or something, is that correct? I don't know because the current implementation allows for you to use both "equal" and identical nodes.

Araq commented 3 years ago
  1. IIRC this won't work and shouldn't be allowed.
  2. Yes but I consider features that "break IC" moribund. IC is not negotiable as far as I'm concerned. Maybe we need a more extensive macrocache API to support more use-cases, maybe we can do away by tracking dependencies in the VM without API additions but IC is more important than anything else to me. Especially now that arc/orc and =hooks are production ready.
  3. They have an arbitrary order, much like iterating over a hash table.
Araq commented 3 years ago

The first milestone is complete: IC works for 4 carefully selected test cases. There will be a follow-up milestone.