nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).
https://nim-lang.org
Other
16.49k stars 1.46k forks source link

Function-based closure iterator implementation #23425

Open arnetheduck opened 6 months ago

arnetheduck commented 6 months ago

Summary

The current implementation of closure iterators sees the implementation use a duff's device to implement the state machine between each yield.

A more efficient approach would be to put each state in its own function and change states by assigning a function pointer (instead of an index).

Description

A nim function like:

iterator test(): int {.closure.} =
  yield 0
  yield 1
  yield 2

currently gets transformed into a giant switch like so:

N_LIB_PRIVATE N_CLOSURE(NI, test__testit_3)(void* ClE_0) {
    NI result;
    tyObject_Env_testitdotnim_test___LFfJfhA7fiXxmdXYuA4CYw* colonenvP_;
{   result = (NI)0;
    colonenvP_ = (tyObject_Env_testitdotnim_test___LFfJfhA7fiXxmdXYuA4CYw*) ClE_0;
    while (1) {
        if (!1) goto LA1;
        {
            switch ((*colonenvP_).colonstate_) {
            case -1:
             goto BeforeRet_;
            case 0: goto STATE0;
            case 1: goto STATE1;
            case 2: goto STATE2;
            case 3: goto STATE3;
            }
            STATE0: ;
            (*colonenvP_).colonstate_ = ((NI) 1);
            result = ((NI) 0);
            goto BeforeRet_;
            STATE1: ;
            (*colonenvP_).colonstate_ = ((NI) 2);
            result = ((NI) 1);
            goto BeforeRet_;
            STATE2: ;
            (*colonenvP_).colonstate_ = ((NI) 3);
            result = ((NI) 2);
            goto BeforeRet_;
            STATE3: ;
            (*colonenvP_).colonstate_ = ((NI) -1);
            goto LA2;
        } LA2: ;
    } LA1: ;
    }BeforeRet_: ;
    return result;
}

A more efficient approach would be to put the code of each state into a separate function, roughly:

test__testit_3_STATE0(...) {
...
(*colonenvP_).colonstate_ = test__testit_3_STATE1;
}
test__testit_3_STATE1(...) {..}
...

Because each state is independent, this would allow the C compiler to better analyze the function and for example perform more aggressive inlining which is more difficult when all states share a single function.

This would also open up another feature, namely that of moving variables out of the heap-based environment and putting them in the "small" local functions, where applicable (ie where a variable does not cross state boundaries).

Such small functions are also candidates for dead-code-elimination - ie if it turns out that a particular state never is used due to inlining, constant propagation and other optimization, that state can be removed entirely instead of still taking up space - this happens commonly in generic code and/or macro-generated code.

In the example, there is of course little benefit, but closure iterators are the mechanism that power chronos and other async implementations, where code is often much more complex, leading to very large state transition functions.

Alternatives

No response

Examples

No response

Backwards Compatibility

No response

Links

No response

yglukhov commented 5 months ago

I wanted to do that a while ago, and still think I might, but such change will potentially break existing code in 2 ways:

  1. Closure env :state field will change from int to function pointer. It's a low level abi change, that anyone will hardly notice, and whoever relies on it (like me) will likely be ok with the breakage.
  2. Optimizing locals out of the env will break existing code that takes addresses of such locals and relies on the fact that it will live through the yield. With this optimisation it might not. This is a more critical, "user-space" breakage. We could unoptimize vars of which address is taken, but it's a hacky heuristical complication.
arnetheduck commented 5 months ago

Well, 1. is not defined behavior, is it? ie the internal fields are internal for a reason, no?

re 2., this feels like it falls into UB territory, and is dubious at best to begin with - is it documented or just an implementation detail that some have started using? basically, the address of a local is never guaranteed to be unique or not to be overwritten, why would one expect that in a closure environment? consider that if you have a local whose liveness is bounded, one is typically allowed to reuse that memory location for a different local as long as the liveness analysis shows no overlap - there is no expectation in general that memory addresses that are not on the heap will be stable - just like you wouldn't return the address of a local from a function. Also, the addr thing, why is it hacky? It simply expands the liveness to "undefined" which pessimistically means it can be moved to the env..

yglukhov commented 5 months ago
  1. Sure
  2. Thinking more about it, it should be fine for the most part. Reads-into-buffers imply that the buffer is used after yield, and thus survives. However there are still cases like this, which is on the edge of memory safety. Pointers in async are sometimes the way to go, because async doesn't allow passing by var. addr analysis is hacky because it will be done only within the closure iter, while addr op itself can be in the callees, and again addr doesn't automatically mean that it is expected to live through the yield, but all addrs will be unfairly unoptimized. But in the end I agree that probability of old code failing is pretty low, and I would gladly neglect it.

EDIT: Memory-unsafe example is safe. So now I'm not aware of any live code that will fail.

arnetheduck commented 5 months ago

while addr op itself can be in the callees,

I think in nim, we generally assume that if something is passed by var into a function (and not by ptr), there is no lifetime expectation of the address of the var beyond that function call, ie storing addr inside a function of a var parameter is UB beyond that one call.

Araq commented 5 months ago

Correct.

mratsim commented 3 months ago

When experimenting with manual CPS, I've looked into the storage problem, it's a raw but this works, except with JS backend: https://github.com/nim-works/cps/blob/7e6ed65/talk-talk/manual1_stack.nim

  1. Every yield section of a CPS-ed functions or closure iterator gets it's own env object

    Env_foo_0 = object
      n_gensymmed: int
      i_gensymmed: int
    
    Env_foo_1 = object
      n_gensymmed: int
      i_gensymmed: int
    
    Env_foo_2 = object
      n_gensymmed: int
      i_gensymmed: int
      j_gensymmed: int
  2. The full state is stored in an {.union} to guarantee that space is the biggest state used. For compat with JS this can be changed to a case object, but the states and transitions are known at compile-time so it's not really needed.

    # This is an object which is large enough to hold any of the above, and is
    # used for the initial allocation.
    
    Env_foo_storage {.union.} = object
      stor_Env_foo_0: Env_foo_0
      stor_Env_foo_1: Env_foo_1
      stor_Env_foo_2: Env_foo_2
      stor_Env_foo_3: Env_foo_3
      stor_Env_foo_4: Env_foo_4
      stor_Env_foo_5: Env_foo_5
      stor_Env_foo_6: Env_foo_6
      stor_Env_foo_7: Env_foo_7
  3. The continuation (or closure iterator), stores the function and the envs storage
    C = object
      fn: proc(c: var C) {.nimcall.}
      storage: Env_foo_storage
  4. Each segment restores env variables, does processing, and swap the function call to the next one
    # Original function
    proc foo(n: int) =
    var i = 0
    while i < n:
      # sleep(1)
      var j = 0
      while j < n:
        echo i, " ", j
        # sleep(1)
        inc j
      inc i
    var j = "done"
    # sleep()
    echo j
  template injectVar(T, id: untyped) =
    template id(): untyped = (c.storage.`stor _ T`.`id gensymmed`)

  proc foo_0(c: var C) =
    injectVar(Env_foo_0, n)
    injectVar(Env_foo_0, i)
    i = 0
    c.fn = foo_1

  proc foo_1(c: var C) =
    injectVar(Env_foo_1, n)
    injectVar(Env_foo_1, i)
    if i < n:
      c.fn = foo_2
      noop(c)
      return
    c.fn = foo_6

  proc foo_2(c: var C) =
    injectVar(Env_foo_2, n)
    injectVar(Env_foo_2, i)
    injectVar(Env_foo_2, j)
    j = 0
    c.fn = foo_3

And interesting part is that if the continuation/closure:

You can allocated it fully on the stack (called halo - heap allocation elision in Clang++). And if you allocate it fully on the stack, the C compiler can do constant propagation and folding: