disruptek commented 3 years ago

It's an open secret that I've some criticisms of async/await:

Fear of Threads

From the perspective of a newcomer who is looking to transfer existing knowledge or code from another language to Nim, or someone who has zero experience with concurrency in any language, I recommend threads.

Here follows the first example from the threads.nim documentation. I tweaked the syntax a touch -- it was written in 2011 -- but that is all.

import locks

type Payload = tuple[a, b: int]

var
  thr: array[4, Thread[Payload]]
  L: Lock

proc threadFunc(interval: Payload) {.thread.} =
  for i in interval.a .. interval.b:
    withLock L:
      echo i

initLock L

for i in 0 .. thr.high:
  createThread(thr[i], threadFunc, (i*10, i*10+5))
joinThreads thr

deinitLock L

Here's the compilation output:

$ nim c --gc:arc --define:danger --threads:on -f th.nim
.....CC: stdlib_assertions.nim
CC: stdlib_io.nim
CC: stdlib_system.nim
CC: stdlib_locks.nim
CC: th.nim
Hint: 22456 lines; 0.121s; 20.68MiB peakmem; Dangerous Release build; proj: /home/adavidoff/th.nim; out: /home/adavidoff/th [SuccessX]

Here's the codegen:

Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                                5             57             54           3733

The resulting binary is 52,472 bytes in size and links against:

-rwxr-xr-x. 1 root root 147232 Nov  2 22:20 /lib64/libpthread-2.32.so

Here's an invocation:

0.00user 0.00system 0:00.00elapsed 146%CPU (0avgtext+0avgdata 1592maxresident)k
0inputs+0outputs (0major+103minor)pagefaults 0swaps

A quick run of golden:

┌────────┬──────────┬──────────┬──────────┬──────────┐
│ Runs   │ Min      │ Max      │ Mean     │ StdDev   │
├────────┼──────────┼──────────┼──────────┼──────────┤
│  10000 │ 0.000266 │ 0.009509 │ 0.000391 │ 0.000120 │
└────────┴──────────┴──────────┴──────────┴──────────┘

And a valgrind, which probably means nothing.

==1915== HEAP SUMMARY:
==1915==     in use at exit: 0 bytes in 0 blocks
==1915==   total heap usage: 33 allocs, 33 frees, 3,164 bytes allocated
==1915== 
==1915== All heap blocks were freed -- no leaks are possible

Macro Complexity

or, if you think you're smart,

Poor Documentation

This is my port of the above program to async/await:

import asyncdispatch

type Payload = tuple[a, b: int]

var
  thr: array[4, Future[void]]

proc asyncFunc(interval: Payload) {.async.} =
  for i in interval.a .. interval.b:
    echo i

for i in 0 .. thr.high:
  thr[i] = asyncFunc (i*10, i*10+5)

await all(thr)

Here's the compilation output:

$ nim c --gc:arc --define:danger -f aw.nim
....................................
/home/adavidoff/aw.nim(15, 7) template/generic instantiation of `await` from here
/home/adavidoff/nims/lib/pure/asyncmacro.nim(144, 3) Error: 'yield' only allowed in an iterator

Oops. I didn't import asyncmacro and I never used yield. Curious.

A quick look at the documentation for asyncdispatch yields a single occurrence of "asyncmacro"; in fact, it's the last sentence of the documentation, under a section helpfully titled "Limitations/Bugs": asyncdispatch module depends on the asyncmacro module to work properly.

I have no idea what this means. As far as I can tell, the asyncdispatch module depends on the asyncmacro for compilation failures. 😁

But I know the problem must be in await, right? There's no documentation for the await template and the documentation that does exist for using await is remarkably unhelpful.

Take a critical read of the documentation for asyncdispatch some time. It's worthy of another long thread, so to speak, because it's pretty hard to understand.

Thankfully, there are experienced users on IRC that point out that I should use waitFor instead of await. I remember reading about waitFor but since it merely said it blocks the current thread, I figured it wasn't what I wanted at all. I feel stupid, now.

But at least I can compile my program:

$ nim c --gc:arc --define:danger -f aw.nim
/home/adavidoff/nims/lib/pure/asyncdispatch.nim(1286, 39) Hint: passing 'adata.readList' to a sink parameter introduces an implicit copy; if possible, rearrange your program's control flow to prevent it [Performance]
CC: stdlib_assertions.nim
CC: stdlib_io.nim
CC: stdlib_system.nim
CC: stdlib_math.nim
CC: stdlib_strutils.nim
CC: stdlib_options.nim
CC: stdlib_times.nim
CC: stdlib_os.nim
CC: stdlib_heapqueue.nim
CC: stdlib_deques.nim
CC: stdlib_asyncfutures.nim
CC: stdlib_monotimes.nim
CC: stdlib_nativesockets.nim
CC: stdlib_selectors.nim
CC: stdlib_asyncdispatch.nim
CC: aw.nim
Hint: 56742 lines; 0.558s; 76.172MiB peakmem; Dangerous Release build; proj: /home/adavidoff/aw.nim; out: /home/adavidoff/aw [SuccessX]

Here's the codegen:

Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                               16            236            182          10840

The resulting binary is 125,704 in size and links against:

-rwxr-xr-x. 1 root root  147232 Nov  2 22:20 /lib64/libpthread-2.32.so
-rwxr-xr-x. 1 root root 1263504 Nov  2 22:21 /lib64/libm-2.32.so
-rwxr-xr-x. 1 root root   35080 Nov  2 22:21 /lib64/librt-2.32.so

Here's an invocation:

0.00user 0.00system 0:00.00elapsed 93%CPU (0avgtext+0avgdata 1948maxresident)k
0inputs+0outputs (0major+99minor)pagefaults 0swaps

A quick run of golden:

┌────────┬──────────┬──────────┬──────────┬──────────┐
│ Runs   │ Min      │ Max      │ Mean     │ StdDev   │
├────────┼──────────┼──────────┼──────────┼──────────┤
│  10000 │ 0.000242 │ 0.013270 │ 0.000323 │ 0.000190 │
└────────┴──────────┴──────────┴──────────┴──────────┘

And a valgrind, which probably means nothing.

==653== HEAP SUMMARY:
==653==     in use at exit: 736 bytes in 12 blocks
==653==   total heap usage: 43 allocs, 31 frees, 3,056 bytes allocated
==653== 
==653== LEAK SUMMARY:
==653==    definitely lost: 256 bytes in 4 blocks
==653==    indirectly lost: 192 bytes in 4 blocks
==653==      possibly lost: 288 bytes in 4 blocks
==653==    still reachable: 0 bytes in 0 blocks
==653==         suppressed: 0 bytes in 0 blocks
==653== Rerun with --leak-check=full to see details of leaked memory

CPS

This example uses a so-called untyped implementation from August; it's a bit more verbose because it demonstrates the machinery explicitly, in case you are new to continuation passing style.

The newer typed implementation is better, in terms of both syntax and semantics (read: performance and performance), but it is currently blocked by at least one compiler bug.

import cps

type
  Payload = tuple[a, b: int]
  C = ref object of RootObj   # a continuation
    fn: proc(c: C): C {.nimcall.}

proc dispatch(c: C) =         # a dispatcher
  var c = c
  while c != nil and c.fn != nil:
    c = c.fn(c)

proc cpsEcho(i: int): C {.cpsMagic.} =  # i/o
  echo i
  return c

proc cpsFunc(interval: Payload) {.cps: C.} =
  var i: int = interval.a
  while i <= interval.b:
    cps cpsEcho(i)
    inc i

for i in 0 .. 3:
  dispatch cpsFunc((i*10, i*10+5))

Here's the compilation output:

$ nim c -r --define:danger --gc:arc -f cp.nim
# ...lots of debugging output omitted...
CC: stdlib_io.nim
CC: stdlib_system.nim
CC: cp.nim
Hint: 40002 lines; 0.455s; 61.277MiB peakmem; Dangerous Release build; proj: /home/adavidoff/git/cps/cp.nim; out: /home/adavidoff/git/cps/cp [SuccessX]

Here's the codegen:

Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                                3             38             35           3106

The resulting binary is 46,000 bytes in size and links against no novel libraries. Here's an invocation:

0.00user 0.00system 0:00.00elapsed 94%CPU (0avgtext+0avgdata 1772maxresident)k
0inputs+0outputs (0major+87minor)pagefaults 0swaps

A quick run of golden:

┌────────┬──────────┬──────────┬──────────┬──────────┐
│ Runs   │ Min      │ Max      │ Mean     │ StdDev   │
├────────┼──────────┼──────────┼──────────┼──────────┤
│  10000 │ 0.000077 │ 0.011548 │ 0.000267 │ 0.000145 │
└────────┴──────────┴──────────┴──────────┴──────────┘

And a valgrind, which probably means nothing.

==32426== HEAP SUMMARY:
==32426==     in use at exit: 0 bytes in 0 blocks
==32426==   total heap usage: 29 allocs, 29 frees, 2,200 bytes allocated
==32426== 
==32426== All heap blocks were freed -- no leaks are possible

Performance

Obviously, only the threaded version is actually asynchronous. 😁

And, I mean, sure, this isn't a real program. So why is it so large?

The async/await version generated ~3x more code.
The async/await version's binary is 2.5x the size.
The threaded version is slower despite using 50% more CPU.

And CPS...

Has no platform-specific code, no extra libs, and the least C output.
Has no lock, no yield, void, or discard novelties, no gcsafe, no var or closure limits, etc.
Has both a higher upper-bound to performance and less apparent jitter despite using the same amount of CPU.
Requires compilation of twice as many lines of Nim (versus threads) and is ~4-5x slower to do so.
Has lower memory usage here, though the typed implementation of CPS has fewer allocations still. In fact, more optimization is possible but we currently favor supporting fewer continuation types instead -- laziness, really.
Should be comparable in terms of readability; something like this example with a proc macro:

import cps

type Payload = tuple[a, b: int]

proc cpsFunc(interval: Payload) {.cps.} =
  for i in interval.a .. interval.b:
    echo i

for i in 0 .. 3:
  cpsFunc (i*10, i*10+5)

or this version with a callsite pragma:

import cps

type Payload = tuple[a, b: int]

proc cpsFunc(interval: Payload) =
  for i in interval.a .. interval.b:
    echo i

for i in 0 .. 3:
  cpsFunc((i*10, i*10+5)) {.cps.}

or this version using an effect:

import cps

type Payload = tuple[a, b: int]

proc cpsFunc(interval: Payload) {.tags: [Continue].} =
  for i in interval.a .. interval.b:
    echo i

for i in 0 .. 3:
  cpsFunc (i*10, i*10+5)

...which brings me to...

Composition

threads:on

I can't say that thread support under async/await is arcane, exactly, but it's clearly not a natural extension of the design the way it is with CPS.

color:on

If you aren't familiar with function color, you'll enjoy the following:

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/

leaks:on

Look, let's just assume for the sake of argument that async/await is leak-free under ORC; though by the time you read this, there may be another leak. Or maybe there isn't. It doesn't matter. I don't care, and neither should you.

gc:orc

Except that the position that "it's ORC's job to be leak-free under async" is silly. We don't want to have to choose between deploying new change-the-world features that improve virtually every aspect of the language... and supporting a concurrency implementation (even if the code were good).

We want everything to work, all the time, everywhere, right? Right?

asyncInside™

This async implementation, which is either stable (except for bug fixes) or stagnant ("feature-complete"), depending on who you ask, should not be the gatekeeper for compiler development. This is an argument against tying the implementation into the compiler. In fact, this tie was broken with chronos.

asyncOutside™

So, yeah, now the community has to support two different async implementations which largely suffer the same flaws, are incompatible in subtle ways, and of which neither are reliable under the best memory manager for Nim.

except:

Exceptions are handled in separate and unequal ways...

threads: Register a handler = proc() callback.
CPS: Catch them in your dispatcher, what else?
async/await: I mean, I can't even... Maybe someone else wants to try?

What CPS `is`

CPS is research-grade; I would say it's above alpha-quality.

CPS is pure Nim. It probably compiles as slowly as it does because it uses concepts, which allow you to specify the details of your continuation implementation and those of your dispatcher. Everything is exposed to you, and the deep access you have allows for novel forms of control flow, concurrency, scope, and abstraction.

What CPS `isnot`

I'm coming up on the 2 year anniversary of my async/await divorce 💔: https://irclogs.nim-lang.org/29-04-2019.html#16:31:55

CPS is not CSP (communicating sequential processes), but it starts to give us the bones with which we might build such an elegant beast.

To wit, I think I'm prepared to help with a CSP implementation to rival that of Go's channels and Clojure's async.core, but as @Zevv likes to point out, it'd be a lot easier to implement CSP if we had a solid CPS working.

git blame

Threads

I'm really happy with threads. They work, they are cheap from a cognitive perspective, the memory semantics are clear, and they are easier to use in Nim than I expected. I regret not using threads for all my concurrent code.

The implementation is simple, it seems pretty stable, and threads will clearly need to compose with future concurrency implementations for both hardware and software reasons.

Async/Await

As @Araq says, "I don't care."

This is not the Nim concurrency story I want to tell to newcomers, and I think if you're honest, it's not the one you want to share, either. Never has been. It does not do justice to the elegance of the language, its platform support, or its performance. It doesn't even have useful documentation.

It's kind of like the Porsche gearbox -- charming or even cute, from a technical perspective, but to be pragmatic: easily the worst part of the car.

CPS

@Araq shared a paper.
@Zevv implemented it.
@Araq asked @disruptek to work on it.

If it makes you feel better, you can blame me for believing CPS is a game-changer for Nim and you can blame me for criticizing async/await. Please first spend a couple years helping (or even just watching) innumerable people using software that is an objectively poor choice from the perspective of performance, documentation, simplicity, and compromise. Now you can also blame me for sending too many people down the wrong path.

Action Items

I've tried to convey a tone of authority but don't take it too seriously -- that's just how I talk about facts. 😉

I know how we implemented CPS (and in particular, what sucks about it) but I really don't know much about how it should be exploited. I've tried to capture some ideas in the repository, but @Zevv has done most of the design work and has a better grasp of the sorts of novel behaviors that CPS enables.

Most of my experience with coroutines is limited to Python, where they were reliably a huge PITA. Go was great. Anyway, please take a look at the code:

https://github.com/disruptek/cps

Look at the examples (in stash). Look at the tests. Read the papers (the 1011 one is a thriller, I promise). Get excited about this: it's not science fiction; it already works, today, and it can work even better tomorrow.

You are going to invent new ideas on how we can exploit this style of programming. You are going to see problems with our work (especially @Zevv's) that you can report or fix. You may even be motivated to improve async/await. All to the good.

You are going to complain about design choices and demand that they are fixed. Bring it on. This software is ready to be taken further while there are no users to worry about breaking. @Zevv has even promised to do some typing for you. At least, I think that's what he said. 🤔

I believe @Araq would prefer that CPS was implemented in the compiler. I don't know if he is motivated by the fear that it will be stolen if it is exposed in a library, or whether he simply wants the implementation to be unfettered by limitations in macro spec or compilation bugs. I personally think these are valid concerns, and of course there are probably other good reasons to target the compiler directly, as well.

Problem is, more users are intimidated by compiler code, and they have every justification to feel this way. Most of the compiler is undocumented and new code rarely receives comments. It's hard to know where to begin and embedding CPS in the compiler will cause it to require at least some compiler expertise to service -- a commodity in poor supply.

Limiting CPS to the compiler may also make it less accessible to users of earlier versions of Nim. It may even fracture CPS support across different versions of the compiler. If you agree that CPS should not be implemented in the compiler, please give the RFC a 👎.

While I feel CPS is quite powerful, I'd like the typed implementation to work, as it promises a much better experience. I believe this is a practical prerequisite of future work, so building a similarly powerful abstraction for CSP is blocked until we choose the next step for CPS.

Araq commented 3 years ago

he simply wants the implementation to be unfettered by limitations in macro spec or compilation bugs.

Yes, that's why. While I have many (good?) ideas how to improve Nim's macro system to be easier to use, more stable, etc etc, I think async/CPS/concurrency is so important that you might as well put it into the language directly, or pretend that it is "just a macro" that delegates most of its work to a compiler transformation.

However, that didn't work well for spawn at all and there are rumors that Nim's closure iterators can be better as a library (@timotheecour ) so it's probably a good idea to keep it as a macro. For a prototype implementation that's even more true.

While I feel CPS is quite powerful, I'd like the typed implementation to work,

It's not only your feeling, it's true. Nim's untyped macros do not compose, typed ones do. We're moving to typed macros but currently these are less developed.

Araq commented 3 years ago

Except that the position that "it's ORC's job to be leak-free under async" is silly.

With 1.4.2 all known leaks and crashes have been fixed but it remains true that:

ORC is full of heuristics that can go terribly wrong. (Memory consumption: 90MB, added a GC_fullcollect in a loop, memory consumption: 7MB.) I expect it's rather easy to improve this though.
ORC has no "cost" model.
ORC can be awesome, overhead can be unmeasurable.
Callback heavy code is hard to work with under arc/Swift ("what's a 'weak' self? Why do I need to care about it?"). Because ARC lacks the cycle collector.

However, these facts change nothing: Ideally our concurrency story doesn't depend on a cycle collector.

disruptek commented 3 years ago

Ideally our concurrency story doesn't depend on a cycle collector.

I think you're agreeing with me, but to be clear, I'm not speaking of CPS, which was designed from the outset to exploit gc:arc. I'm speaking of async/await, which perhaps due to its position in the standard library, promotes the expectation that it should work without fault, especially when gc:orc becomes the default. For all their "complexity", threads don't present anywhere near the same challenges.

Ideally, our memory story doesn't depend on a concurrency implementation.

I don't see anyone volunteering to bring async/await to the same standard of quality under gc:orc as the rest of the standard library. If faced with deprecating it or delaying gc:orc as the default, well... I'd rather enable easier integration of Nim into foreign codebases than carry a flawed concurrency model that turns off newcomers before they even attempt FFI. But I'm open to a third way.

FWIW, gc:arc narrows the performance delta between threads/async/cps in the above examples. Threads and async/await are slower under gc:refc while cps seems unaffected.

Araq commented 3 years ago

For all their "complexity", threads don't present anywhere near the same challenges.

I agree completely, threads have 3 problems:

Overhead of stack memory management. (Cannot have millions of them easily).
Sharing introduces data races.
Locks can produce deadlocks.

I know how to solve 2-3, but 1 is a nasty problem that made us look into async back then.

zevv commented 3 years ago

Threads do not offer the choice of doing asynchronous I/O on a single thread - making sure that no threads are involved can make code much easier to reason about.

Araq commented 3 years ago

I don't see anyone volunteering to bring async/await to the same standard of quality under gc:orc as the rest of the standard library. If faced with deprecating it or delaying gc:orc as the default, well...

Getting our async to work with --gc:orc was a painful exercise but I consider it completely done now. And it doesn't delay making gc:orc the default, not anymore.

disruptek commented 3 years ago

Threads do not offer the choice of doing asynchronous I/O on a single thread - making sure that no threads are involved can make code much easier to reason about.

Threads, in fact, do offer the choice of doing async I/O on a single thread. If you don't want to reason about multiple threads, sure, don't use multiple threads. But I certainly take issue with the assertion that reasoning about our async/await dispatcher is somehow easier. There's lots of evidence that this is not the case at all.

To put that another way, no one doing advanced concurrency is going to choose the async/await dispatcher over threads because it's easier, and that's also not merely a feeling, it's truth. :grin:

There is a sweet spot for async/await, and it's in memory mutability. Not having to use locks is a win in some cases. But note that we didn't need the locks in the above example, either, and I would argue that locks generally make threads easier to reason about, not harder.

dom96 commented 3 years ago

To put that another way, no one doing advanced concurrency is going to choose the async/await dispatcher over threads because it's easier, and that's also not merely a feeling, it's truth. 😁

What does advanced concurrency here mean?

As far as I'm concerned async/await is much easier than threads in Nim right now for practically all use cases involving IO, but that can be attributed to the thread-local heaps requirement and all the associated mechanisms to make that work (gcsafety). The evidence of this is in terms of usage, if threads were really easier then people would be using them for this purpose, but I know of very few that do.

disruptek commented 3 years ago

What does advanced concurrency here mean?

It means concurrency that is a design requirement and not merely a convenience.

As far as I'm concerned async/await is much easier than threads in Nim right now for practically all use cases involving IO, but that can be attributed to the thread-local heaps requirement and all the associated mechanisms to make that work (gcsafety).

I did say going -- we're talking about the gc:orc future we all aspire to -- but it sounds like you're agreeing with me. I await the much-needed gcsafe RFC.

The evidence of this is in terms of usage, if threads were really easier then people would be using them for this purpose, but I know of very few that do.

How are you measuring this?

juancarlospaco commented 3 years ago

CPS use Concepts, Concepts are Experimental, Async/await is not Experimental, therefore Concepts must be made to Stable before CPS replaces Async/await.

zevv commented 3 years ago

CPS does not strictly need concepts, my initial implementation worked fine without. It's just part of disrupteks implementation but not essential.

juancarlospaco commented 3 years ago

ORC/ARC Leaks is not fault of Async/await.

CPS will not replace Threads, people may want to use Threads anyway.

Do you own a Porsche?.

juancarlospaco commented 3 years ago

How does a {.multisync.} function with CPS looks like ?.

disruptek commented 3 years ago

Whether something is asynchronous or not is a function of the dispatcher you define, so you either supply a dispatcher like the one above -- effectively .sync. -- or we will supply (or generate) one for you.

juancarlospaco commented 3 years ago

Then {.multisync.} must start its Deprecation process soon (?). Or there will be a macro that adapts {.multisync.} code to CPS somehow (?).

disruptek commented 3 years ago

Why do you think async/await should be deprecated?

disruptek commented 3 years ago

To put that more accurately; you can run the same CPS code with any dispatcher you want, at any time. When you invoke your continuation, you choose which dispatcher dispatches it -- in the same way that you may choose between waitFor and await -- so you could have a dozen different shades of asynchronicity if you really wanted. You could even use async/await for dispatch, or something that is purely thread-based. So there's no technical reason that CPS cannot compose with the existing concurrency methods.

mratsim commented 3 years ago

@Araq

However, that didn't work well for spawn at all and there are rumors that Nim's closure iterators can be better as a library (@timotheecour ) so it's probably a good idea to keep it as a macro. For a prototype implementation that's even more true.

While I do agree it is because of other non-technical reasons, I don't think we can conclude that from technical reasons alone.

Nim is a very young language, compiler developers and developers in general cannot know everything and in particular what kind of new patterns will emerge when a language introduces features that were never seen before, in Nim the macro and pragma system. Case in point, even in Rust the compiler devs use to provided an encode/decode serialization library in the standard library but the traits+macro system of Rust enabled a high-performance, very-convenient API serialization library called Serde that became the defacto standard. I've already said this but as the community grows, the standard library should have a process to deprecate previously core libraries to external packages and ultimately if necessary only provide an API so that API "producers" and API "consumers" are compatible with each other.

There are many domains where there is no unique way to address an issue https://github.com/nim-lang/RFCs/issues/247#issuecomment-680997258 @mratsim

Marshal/serialization/deserialization

Crypto API (initContext, hash, sign, verify)

Streams

Async

Threads/threadpools/executors

Coroutines

Unittests

Logging

BigInt

Database

This might need more fleshed out concepts/interfaces though we can use duck typing with generics https://github.com/nim-lang/RFCs/issues/243, https://github.com/nim-lang/RFCs/issues/168

@Araq

It's not only your feeling, it's true. Nim's untyped macros do not compose, typed ones do. We're moving to typed macros but currently these are less developed.

Are people using typed macros? My strategy when being handed over typed AST despite asking for untyped (thanks generics) is to rebuild the untyped AST because it's way easier to handle.

For me the one thing that makes macro hard to compose is the type system: https://github.com/nim-lang/RFCs/issues/44, https://github.com/nim-lang/Nim/issues/8677

@Araq

For all their "complexity", threads don't present anywhere near the same challenges.

I agree completely, threads have 3 problems:
1. Overhead of stack memory management. (Cannot have millions of them easily).

2. Sharing introduces data races.

3. Locks can produce deadlocks.
I know how to solve 2-3, but 1 is a nasty problem that made us look into async back then.

I solved that in Weave and you can have trillions of tasks each with their own stack https://github.com/mratsim/weave/blob/e5a3701/weave/datatypes/sync_types.nim#L18-L54

To solve this in a portable way I introduced the following limitations:

only 144 bytes of user data per task, 256 bytes max for the task (the same as Erlang actors AFAIK). This means 18 integers/pointers captured at most per task which seems very reasonable and you can always malloc for more.
a multithreaded memory pool based on the latest memory manager research (Mimalloc and Snmalloc) with a particular focus on handling producer-consumer scenarios i.e. memory allocated in a thread and released in another, the usual nightmare of memory allocator. In particular this memory pool provides a heartbeat to schedule expensive maintenance (shrinking the pool) in a deterministic and amortized way.
a lookaside buffer that provides an extra caching layer so that memory allocated from other threads is reused as much as possible instead of stressing the memory allocator.
Write-up: https://github.com/mratsim/weave/tree/e5a3701/weave/memory
Exploration of the cactus stack problem: https://github.com/mratsim/weave/blob/e5a3701/weave/memory/multithreaded_memory_management.md
Go also struggled with this problem: https://docs.google.com/document/d/1wAaf1rYoM4S4gtnPh0zOlGzWtrZFQ5suE8qr2sD8uWQ/pub
As did Rust: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html
As did Cilk which required a different calling convention than C, some of the solutions there required forking the Linux kernel
Fibril is very interesting, the only solution with even less overhead than Weave and boils down to Fiber, it was not chosen due to assembly, the need of properly handling all calling conventions and a "conviction"-based choice of being channels all the way down but the resulting code is very concise and easy to maintain as well as being high performance: https://github.com/chaoran/fibril and would be my go to for a multithreaded async framework provided I didn't hear about CPS.

@disruptek

To put that another way, no one doing advanced concurrency is going to choose the async/await dispatcher over threads because it's easier, and that's also not merely a feeling, it's truth.

I don't agree, debugging threads is really painful in particular because of non-determinism coupled with memory/ownership issues. Until now, threads = no strings, seq or ref objects which makes it a non-starter for plenty of use-cases.

@dom96

To put that another way, no one doing advanced concurrency is going to choose the async/await dispatcher over threads because it's easier, and that's also not merely a feeling, it's truth. grin

What does advanced concurrency here mean?

As far as I'm concerned async/await is much easier than threads in Nim right now for practically all use cases involving IO, but that can be attributed to the thread-local heaps requirement and all the associated mechanisms to make that work (gcsafety). The evidence of this is in terms of usage, if threads were really easier then people would be using them for this purpose, but I know of very few that do.

I don't agree either, threads in Nim have even less documentation than async/await. Also the facilities are extremely brittle: threadpool, spawn and the parallel statement only work for some cases but trying to go beyond is fraught with peril. I read the channels.nim implementation and I don't understand what's happening with RTTI as well. What actually works is using raw createThread call and doing your threading and threadpool C style. In that case I must admit that Nim handling of the base OS primitives is excellent and it's much more comfortable to use than in C where you need to deal with Posix/Windows differences. Regarding the documentation, there is a lot more code, for example jester or httpbeast and even your book that people can use as a reference for async/await. There is no equivalent for threads.

I am quite impressed and very excited about your work on CPS, that is really awesome.

Looking into the type:

type
  Payload = tuple[a, b: int]
  C = ref object of RootObj   # a continuation
    fn: proc(c: C): C {.nimcall.}

I assume those are ref objects only now but in the future they will become Isolated https://github.com/nim-lang/RFCs/issues/244 otherwise they can't be moved across threads and a continuation should be owned by a single thread anyway right?

Also one of the main issue with Asyncdispatch and Chronos right now is that they require all Futures to be waited and manually handled. It is very easy to introduce future leaks and resource leaks by introducing a timeout and then forget to dispose properly of the cancelled future, i.e. we need to do manual future management. Is there a sketch of how would cancellation look like with CPS?

Relevant: https://forum.nim-lang.org/t/7065#44457
In particular this article where Rust choose a completely different approach from others with "poll-based" futures: https://aturon.github.io/blog/2016/09/07/futures-design/

zevv commented 3 years ago

@mratsim : "Is there a sketch of how would cancellation look like with CPS?"

CPS itself does not know about cancellations or timeouts, it only provides the building blocks to make continuations from flat and plain Nim code. You can simply choose not to call a continuation, but instead throw it away, which will effectively "cancel" your flow.

Cancellation and timeouts would be a concept of the asynchronous I/O abstraction implementations built on top of CPS.

dom96 commented 3 years ago

@dom96

To put that another way, no one doing advanced concurrency is going to choose the async/await dispatcher over threads because it's easier, and that's also not merely a feeling, it's truth. grin

What does advanced concurrency here mean?

As far as I'm concerned async/await is much easier than threads in Nim right now for practically all use cases involving IO, but that can be attributed to the thread-local heaps requirement and all the associated mechanisms to make that work (gcsafety). The evidence of this is in terms of usage, if threads were really easier then people would be using them for this purpose, but I know of very few that do.

I don't agree either, threads in Nim have even less documentation than async/await. Also the facilities are extremely brittle: threadpool, spawn and the parallel statement only work for some cases but trying to go beyond is fraught with peril. I read the channels.nim implementation and I don't understand what's happening with RTTI as well. What actually works is using raw createThread call and doing your threading and threadpool C style. In that case I must admit that Nim handling of the base OS primitives is excellent and it's much more comfortable to use than in C where you need to deal with Posix/Windows differences. Regarding the documentation, there is a lot more code, for example jester or httpbeast and even your book that people can use as a reference for async/await. There is no equivalent for threads.

@mratsim

It sounds to me like you are agreeing that async/await is easier in Nim right now than threads. Just not for the same reasons I outlined. Am I understanding you correctly?

I think you're factually wrong here about documentation for threads:

My book does in fact have a full chapter dedicated to threads: chapter 6 titled "Parallelism".
httpbeast uses Nim threads, so that's at least one example library using them effectively.

I would also argue that threads are much easier to understand for newcomers, they are after all a fundamental aspect of programming. Nim threads being so low-level means that the barrier for entry isn't super high for newcomers.

Also one of the main issue with Asyncdispatch and Chronos right now is that they require all Futures to be waited and manually handled. It is very easy to introduce future leaks and resource leaks by introducing a timeout and then forget to dispose properly of the cancelled future, i.e. we need to do manual future management.

@mratsim

Couldn't the same be said about threads and/or lightweight green threads/tasks? You can very easily have a thread that has been spawned and never finishes, and get a resource leak this way. I'm curious whether Weave allows cancellation of the threads it spawns, I know that it's not typically possible to destroy a running OS thread, so how does Weave handle this?

I agree however that this is indeed an issue, it has bitten me in the past which is why I have implemented hooks in the async dispatcher to see which futures are "pending". Combined with prometheus it works insanely well for finding resource leaks.

Cancellations of futures would go a long way to help, and are definitely a necessity for some scenarios. If I had the time I would love to look into implementing them. I recall that there were plans for Chronos to get them? Did this not happen?

mratsim commented 3 years ago

It sounds to me like you are agreeing that async/await is easier in Nim right now than threads. Just not for the same reasons I outlined. Am I understanding you correctly?

Yes, I think that async/await are more likely to be used than threads not because one is easier/harder than the other but because more code/examples/tutorials exist with the former in the first place.

httpbeast does use threads but that's more the exception than the rules. As soon as you use threads, at least with the old GC, suddenly most of the standard library cannot be used as easily because strings, seq or references should be use with way more care and few example exist (while say waitFor vs await can be found) to demonstrate how. And then some critical structures people expect like threadsafe hashtables are missing (from Facebook Folly for example).

I'm curious whether Weave allows cancellation of the threads it spawns, I know that it's not typically possible to destroy a running OS thread, so how does Weave handle this?

Weave distinguishes between threads and tasks. init(Weave) and exit(Weave) will create and destroy Weave threads. Destroying Weave threads can only be done from the root thread and will block until all enqueued tasks are finished. Weave does not support cancellation of tasks. I have yet to come on that use case in CPU-bound tasks. If something needs to be interrupted, it comes either from a kernel signal, a timeout or some kind of IO/message, in that case the IO scheduler or async/await dispatcher should be the better place to handle this. If it interfaces with a threadpool or Weave it either should enqueue the task or should let it finish and properly dispose the resources afterwards.

disruptek commented 3 years ago

I don't agree, debugging threads is really painful in particular because of non-determinism coupled with memory/ownership issues. Until now, threads = no strings, seq or ref objects which makes it a non-starter for plenty of use-cases.

Well, it is now now. :grin:

But, yeah, if you understand what threads are and what they are not, then it's sorta on you if you try to shoehorn them into a role they are not suitable for.

I assume those are ref objects only now but in the future they will become Isolated #244 otherwise they can't be moved across threads and a continuation should be owned by a single thread anyway right?

You're correct that we'll want to switch to Isolated, but honestly, I'm not convinced there's a good reason that a continuation should be owned by a single thread. To put that another way, the limitations of the memory model are orthagonal to the limitations of CPS itself. We don't want to adopt these limits until it's strictly necessary.

As @zevv says, what you do with a continuation is up to you. That's the beauty of the approach; you can trivially use as your dispatcher anything that can invoke a function. And even the continuation type itself is yours to define. All we're doing is rewriting the control-flow for you and maybe giving you a little syntax sugar. The bits you actually care about are yours to implement however you want.

Cancellations of futures would go a long way to help, and are definitely a necessity for some scenarios. If I had the time I would love to look into implementing them. I recall that there were plans for Chronos to get them? Did this not happen?

I thought Chronos got cancellations like, I dunno, a year ago.

I would also argue that threads are much easier to understand for newcomers, they are after all a fundamental aspect of programming. Nim threads being so low-level means that the barrier for entry isn't super high for newcomers.

So we agree. I'm happy to move past that debate. :grin:

FWIW, I don't use threadpool, spawn, or parallel. Maybe that's why I don't have threading issues. I understand that Yuri's threadpool is "better" and "should be made stdlib" but I couldn't tell you why. I looked at channels and unless I'm missing something, they are vestigial and useless in a gc:arc world. If these extensions to threading need work or deprecation, so be it. They are hardly the only parts of the language that need attention. Having fewer holes to step in helps newcomers and old hands alike.

And, look, this isn't an effort to codeshame contributions. It's an effort to coalesce goals on where to take CPS from here so that we know where to work. It's been stagnant for months simply because it wasn't clear whether we'd be throwing effort down the drain to continue fixing typed macro bugs in the compiler versus just cheating via a transform pass.

Araq commented 3 years ago

Enough of the talking, time to coordinate

Here is my proposal: @disruptek, @mratsim, @zevv and everybody else who feels left out, please join forces and make CPS the missing part of Weave so that:

Weave supports IO heavy concurrency much like today's async frameworks. But I am not up to date with its current status -- I don't know what is missing from Weave.
Weave supports Nim's strings/seqs/refs for the --gc:orc --threads:on mode.

In addition to that, mark macro/concept/ORC bugs that are blocking the progress with a new "CPS" tag, myself and other compiler devs promise to prioritize these bugs, or at least guide you how to fix them. Well that's my personal promise, I don't speak for other compiler developers.

If you don't need static T support, the concepts that I implemented in my branch could already be very helpful, I'm willing to merge it sooner rather than later if it helps you.

olliNiinivaara commented 3 years ago

Talk is cheap, confronting all the unanticipated "critical" feature requests of end users is what makes software engineering hard (=messy). I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhäuser Gate. Anyway count me in, where can I recruit? And if dom96 would play the devil's advocate, nothing is impossible.

dom96 commented 3 years ago

Now that the holidays are here, I finally got a chance to sit down and re-read this RFC. So there will be a bit more talk (sorry Araq), but hopefully not on this RFC (doubt it :)), rather a brand new one I have created: https://github.com/nim-lang/RFCs/issues/304.

It even takes inspiration straight from this RFC:

You may even be motivated to improve async/await.

Now to get onto the remainder of the essence of what's in this RFC. There is nothing here that shows the power of CPS, all that's here is a bunch of synthetic benchmarks comparing a highly stripped down implementation to a full-blown async IO dispatcher. This is very similar to the V programming language benchmarks. In particular their many claims of being able to compile V code faster than any language out there, when their compiler can only compile the simplest of constructs.

This causes unnecessary concern inside the community. Let's be clear, there are plenty of valid criticisms to be put against async/await and this post does list some. Many of these however are relatively minor and can be readily fixed (see https://github.com/nim-lang/RFCs/issues/304 and let us know what else you would like to see improved).

Some highlights that I wanted to pick out:

/home/adavidoff/nims/lib/pure/asyncmacro.nim(144, 3) Error: 'yield' only allowed in an iterator

Yes, documentation for async/await is definitely lacking, and my book is no replacement. These kinds of errors are also annoying, especially since this particular one is a regression, as before await was turned into a template you would have gotten a much better error. I'd love to fix this but I'm not even sure if it's possible, a template cannot know where it is invoked, can it?

And a valgrind, which probably means nothing.

Yes, async/await leaks with --gc:arc, as expected. This is why --gc:orc is a thing. It's not fair to use this as a criticism of async/await in the way that you did.

It is fair to say that --gc:arc works with CPS, but then you need to demonstrate why that is important. Why not just use --gc:orc when it's working well?

So at the end I strongly urge yourself and others to commit to improving async/await. For all its faults it is still a very good way to write IO-heavy code, and probably one of the best tested libraries in Nim's standard library. Exploring new routes and doing research on these things is important, and I would encourage you to continue if you find it fun and interesting. But there is no reason to be bringing down async/await in the process and trying to make the community feel like async/await has no future.

We are a small community, we cannot be reimplementing these things constantly and asking our users to rewrite their code on every whim unless there is a very good reason to do so. It's easy to think that your solution will be perfect, but it will undoubtedly have it's own faults which will take years to iron out, by that point someone else might come around and want to rewrite things yet again.

olliNiinivaara commented 3 years ago

Finding my previous comments a bit cryptic, let me recapitulate:

I have some use cases in mind that should enable comparison of existing and proposed implementations in terms of both ergonomics and performance:

do CPU-intensive work in parallel, but occasionally wait for user input (wait for next three mouse clicks, say)
some task hangs (infinite loop or waits for input forever). What proportion of CPU power is lost?
allow end user to write their own (sub?)dispatchers (maybe they are doing some weird embedded I/O stuff)
streaming pipeline mixture of CPU and I/O work with back pressure: receive more data than fits main memory (in chunks), operate on it (decrypt, decompress) write it to disk, create new message (compress, encrypt), write chunk to output that is slower than the input
publish/subscribe web socket server: receive a messages and send them to hundreds of thousands of subscribers that come and go at any time
enable getting 100% performance out of io_uring

In other words, I'm agnostic (and incompetent) in terms of how these things should be implemented, but I have ideas of what should be possible, and I'm willing to act as a tester.

And, now that I'm here, about the social aspects of software engineering: It is both fun and instructive to invent new algorithms, but to fix somebody else's inventions that are in production and where you might break things - it's more like a job than a hobby. Therefore I understand that people are more willing to try out their own alternatives than to fix existing stuff. But this is a small community - everyone can learn in the process and in the end everyone is just genuinely trying to make the end product better.

disruptek commented 3 years ago

Cool, dude. We've mostly been discussing this stuff in real time on the #nim irc channel. Sometimes we even deign to address discord users. :wink:

Araq commented 3 years ago

To make it clear, I see CPS as a long-term replacement for our closure iterators, our async can build on top of CPS once it's mature. I have no idea how far away CPS is from this state. At the same time I welcome contributions to our current async/await implementation, never let the perfect be the enemy of the good. Async/await is our standard and people rely on it even if others moved on to implement their own frameworks.

mratsim commented 3 years ago

Can you explain what you feel is wrong for you about closure iterators? I have my own grip with them, mostly on syntax but I'm curious about the implementer point of view since syntax can be sugar-coated.

Araq commented 3 years ago

A closure iterator is yet-another language builtin feature which is not required:

async is implemented as a complex macro transformation anyway, it might as well do the remaining state machine generation and lifting of locals. Instead we should embrace liftLocals and goto state as the fundamental primitives.
Closure iterators -- like ordinary closures -- conflate routines/callables with state management. This state is implicit and usually causes trouble for serialization. In fact, I know of no serialization framework that really supports closures. This problem is not specific to Nim, every DOM diffing system for JavaScript also faces problems with stateful event handlers that you cannot "diff" properly.

Closures didn't age all that well IMHO, they really are a poor man's object system implementation.

mratsim commented 3 years ago

I am currently dumping my research and design thoughts here:

research: https://github.com/weavers-guild/weave-io/blob/master/research/README.md
preliminary design considerations: https://github.com/weavers-guild/weave-io/blob/master/design/README.md
relevant quotes from the PhD thesis: https://github.com/weavers-guild/weave-io/blob/master/design/extracts_from_phd_thesis.md

I am evaluating approaches in other languages, namely:

C
C++
C#
Concurrent ML
F#
Kotlin
Lua
Javascript React Fiber (they introduced continuations)
OCaml
Python greenlet, stacklet and continulet
Rust async/await and futures
Rust-style futures in C
Swift Async Proposal

I have a couple of questions to restrict the design space:

Do we want this to work at compile-time? (as in call a routine/object in a static context/compile-time proc)
Do we want this to work in Javascript or WebAssembly?
Do we allow inline assembly (we only need store stack pointer and base stack pointer and jump but we need that for all platforms)?
Do we want the solution to be usable in no-alloc (or tightly controlled alloc environment): embedded, kernel, schedulers & memory managers?
Does the {.goto.} pragma work in javascript?
What overhead over a function call is acceptable when calling a suspendable "routine" (function, continuation, coroutine, fiber, green thread, ...)?
How many do we want to allow on something that can only address 4GB (32-bit)? (i.e. what size is acceptable for the "function object"?)
What examples should we use to showcase the new development?
Do we want the routine to be resumable from any thread?
Are there things missing from closure iterators that we want?
Should we support cancellation for that primitive? See 12 below. it might also be better placed in the async/await library.
Should we be able to resume with input?

zevv commented 3 years ago

1,2,3: I think getting this to work in JS and the VM is pretty essential - if we make something that doesn't work "everywhere", we are pretty limited in other usable things built on top of this. So this would also exclude the use of arch-specific assembly.

4: No-alloc is a nice-to-have, but probably not essential. I guess that for your purposes it would be alloc-agnostic.

6: if the primary use case is doing I/O "a bit of overhead" is acceptable. If this is going to be used to implement things like iterators, no more overhead then a function call would be acceptable to be competative with normal iterators.

That totally depends on the use case. For I/O hundreds of thousands would likely be more then enough, but for more generic use cases like CSP I think having millions would make a very nice story.
For me a cool use case would be the Lua-style coroutines showing some nice producer/consumer pipe flow. Disruptek is likely to persue his CSP ideas. And of course being able to run continuations on a effective scheduler properly making use of all CPUs in a machine.

9, 10, 12: I don't know

11: No, the concept of cancellation does not exist at this level. You just "don't call" your continuation.

mratsim commented 3 years ago

That totally depends on the use case. For I/O hundreds of thousands would likely be more then enough, but for more generic use cases like CSP I think having millions would make a very nice story.

4e9 / 10e6 (assuming 10 millions on 32-bit) gives us a budget of 400 bytes of overhead per 'routine' which is a comfy size to work with.

Note that the overhead will significantly depend on the scheduling, if it's done depth-first we will have significantly less simultaneous 'routines' scheduled while if it's done breadth-first many 'routines' might be started.

disruptek commented 3 years ago

It seems silly to mess around with assembly when we can have a native Nim implementation that runs everywhere and compiles to nearly the same code. Maybe after it's stable we can optimize the C version, but I think the interesting development is pretty obviously above CPS and not below it.

There's zero additional overhead to calling a continuation, since that is in fact the same thing as calling a function. Anything more is the fault of the dispatcher. I personally don't care much about performance, since it would be very hard to make this stuff slower than async/await or threads. There's just nothing there to slow down. We're talking about AST transforms, here.

Regarding alloc-free operation, I pushed a demonstration version today that doesn't use the heap. CPS won't care, but this obviously does impact dispatch and environment construction. I'd really like to see an example use case beyond toy iterators, etc. because otherwise, it just feels like a gratuitous development constraint.

9, 10, 11, 12 ... Yes, we already have these features and we don't want to lose them. :wink:

disruptek commented 3 years ago

Remember that it's not merely the case that you can write a custom dispatcher for your continuations. You will. It will be a common step in implementing your control-flow. So, you don't have to make these choices about BFS/DFS or overhead of the scheduler right now. A typical program may have 100 different dispatchers implemented inside it; this is as common a composition as, say, a destructor. Think of the design of the dispatcher as similarly malleable and intentional to the role of the continuation...

mratsim commented 3 years ago

The past couple days, I have been playing with CPS and researched a lot what other languages do and alternative approaches, namely coroutines and fibers. For the really motivated explorers, you can have a glimpse of my research here: https://github.com/weavers-guild/weave-io/tree/master/research.

Now for those who wants to understand what's going on under the hood, this is the mandatory talk:

Kotlin coroutines under the hood https://www.youtube.com/watch?v=YrrUCSi72E8
slides: https://resources.jetbrains.com/storage/products/kotlinconf2017/slides/2017+KotlinConf+-+Deep+dive+into+Coroutines+on+JVM.pdf

In short continuations allow us to implement resumable function. When we reach a suspension point control flow is split into the current instruction and its continuation (the rest of the statements). First-class continuation support allow us to package that in a variable and resume it when we want. Other names include "callbacks" or "named gotos".

Kotlin doesn't expose continuations directly and prefer to leave devs at the level of coroutines but I think there is value, especially to wrap low-level IO to have first class continuations.

Before going full on CPS, I tried to design how a competing approach, revamped closure iterators, might look like at, i.e. they would likely look like coroutines which are resumable functions. They are more general than functions because for control flow functions do:

call = suspend the caller, jump into the subroutine.
return = terminates the subroutine, jump back to the caller.

Coroutines (of the asymmetric kind) support that and adds:

yield = suspend yourself and return to the caller
resume = caller jumps into the coroutine after the previous yield point.

The design draft is at https://github.com/weavers-guild/weave-io/blob/master/design/design_1_coroutines.md, I've merged some of the ideas from the CoroutineTS from C++ which were merged in C++20 before getting stuck at:

passing parameters to suspended functions. A conceptual block, we can do it with macros, but that is equivalent to having a function changing its signature and arguments in the middle of computation. There is no research on the impact on that, this is more of a state machine so I deemed it a bit to experimental. Signature would be for example
```
proc process(path: string,
         buf: var array[4 * 1024, byte],
         expensiveWork(buf: var array[4 * 1024, byte]){.nimcall.}): int =
let f = openFile(file)
var curIdx = 0
while true:
let numBytes = f.readBytes(curIdx, buf)
curIdx += numBytes
expensiveWork(buf, numBytes)
if numBytes == 0:
  break
return result
```

Then I looked into continuations, which are lower-level primitives than coroutines to see how things would look like. Suffice to say they show more than great promise. Language theorists have proven that continuations of the delimited kind are able to represent any kind of monads.

A compelling argument for why delimited continuations are elegant is the paper Representing Monads. Two very short one liners are enough to make any monad native to your programming language. For example your language doesn't support exceptions? Define a maybe monad and use monadic reflection to make them feel like native exceptions. Your language doesn't have mutable state? Define a state monad and use monadic reflection to make it feel like native mutable state. Your language doesn't have coroutines and requires you to write callback manually as in Node.js? Define a monad similar to F#'s async workflows (=~ continuation monad), and now it feels like your language has native asynchronicity. The same goes for monadic parsers. In short continuations can even be used in Haskell to represent mutable state.

What does it mean? For Nim there is no point to implement mutable state or exceptions with continuations (use native C or C++ primitives), however we would be able to shape all our control flow story to any model. Furthermore just like monads in Haskell, they would be composable.

Concretely, we can reimplement:

closures
closure iterators
async/await
fork/join parallelism tasks (spawn/^)
coroutines
Streams
AsyncStreams
Transform any Channel into AsyncChannel (trySend, tryReceive that tells you if you would block and if yes register a continuation on an event notifier and suspend).
Backpressure management, handling rate limiting and DOS protection. Suspended continuations can be very lightweight if done at function entry, just a function call and the parameters. If your queues/channels/buffers are at max capacity you can:
- suspend the enqueueing instead of blocking the producer, and may choose to drop those continuations (= cancel) later if you can't catch up.
- suspend the producer altogether other use cases:
- suspending web queries when we detect rate limit.
- suspending console output when the terminal chokes.
- handle logging sinks from multithreaded sources without garbling/interleaving their output.
Lexers
Parsers
State machine protocols (Wifi, bluetooth handshakes, TLS authentication and session)
Actors
Communicating Sequential Processes
Generators and iterators
Functional lazy lists / infinite lists
rewrite strutils and sequtils in a way both composable and without intermediate allocation, sonwe don't have to explain to newcomers that complains that sequtils is slower than Haskell and strutils is slower than Python that you shouldn't use map+filter or toLowerAscii+repeat+split because the temporary buffers and alloc will kill your performance. And yes there is dup and sink to reuse the buffers but you still make multiple passes over the data.
Async cancellation (we can implement exceptions with continuations after all).
Saving/Restoring any workflow: for example a form on a webpage

with an unified suspend/resume internals that would compose cleanly.

I plan to produce a design document to shape up the syntax and showcase how it can be used for those various use-case.

Now the very interesting thing is that this can be the primitive and interface that anyone using those can use: want to change the standard library parser? if it has a suspend/resume anyone can use it.

Regarding async/await which is the main drive behind those. It will not disappear. The suspend/resume are scheduler agnotisc, async will become a very short syntactic sugar to "suspend this, register a file descriptor, store the continuation in a scheduler" and await "pop the matching continuation, run it". As a transition period I will show how to reimplement closure iterators on top of continuation for a smooth transition period. I expect though that this extra overhead will be discarded and async/await dispatchers will use continuation directly. For user code, that means that there won't be asyncdispatch vs Chronos anymore but a scheduler vs another (still with its Future/Timer/Transport abstraction). Note that you can implement transport and future as thin wrapper on top of continuations as well.

Also for those into performance, I am obviously looking to make continuations:

threadsafe
zero-alloc if they don't escape their scope

I have full confidence they are suitable for Weave and would even bring significant benefits: bounded memory usage and I might even be able to drop my very clever memory management scheme/memory pools thanks to new scheduling techniques they allow me (more at: https://github.com/disruptek/cps/discussions/40#discussioncomment-241263)

So my next steps:

Get a design document up to showcase the ergonomics
Wait for someone to fix the showstopper bugs in the compiler or try to find a workaround or use an old version of CPS (but untyped :/)
Write a simple CPS-aware threadpool
Look into small example to showcase it (iterators, interleaved multithreaded generators)
Implement closure iterators on top
Implement aync/await on CPS directly
Implement streams and async streams
Look into some relevant examples (dom's book?) and translate them.

disruptek commented 3 years ago

FWIW, I gave up on untyped CPS for my cloud-based continuation scheduler. To me, it's not worth pursuing -- I would rather put the time into typed, because it's clear that typed is possible (and quite a bit more exciting and ergonomic) if we can nail down the spec.

I scanned through the Kotlin design and it's not really as elegant as our implementation, but it does seem to offer most of the semantics. Just, y'know, on the JVM and without the ability to reach in and implement your own machinery. So if you like Kotlin's version, you'll love Nim's.

This RFC would be a good place to talk about syntax and composition with other language features like isolation, effects, expr pragmas, and so on.

jido commented 3 years ago

@mratsim you can also look at my work on dodo with regards to continuations

alaviss commented 3 years ago

What will happen to CPS now that @disruptek has been banned from the community?

mratsim commented 3 years ago

What will happen to CPS now that @disruptek has been banned from the community?

I don't plan dropping CPS work, it's just that there is so little time to do everything.

dumblob commented 3 years ago

@mratsim looking at the CPS implementation in C and realizing that the continuation structure seems bigger than what Weave uses now (at least for tasks). The difference seems rather non-negligible so I'm not sure whether using CPS as the underlying "assembly" for nearly everything related to parallelism & concurrency is the best idea.

mratsim commented 3 years ago

The continuation structure should just be (function pointer, environment capture), see https://github.com/disruptek/cps/blob/e8b168f8997924e651b198588e0630d95c061918/talk-talk/manual1_stack.nim#L93-L95

type
    C = object
      fn: proc(c: var C) {.nimcall.}
      storage: Env_foo_storage

And for coroutine there is an extra promise at least and maybe an extra bool for termination (I think we can avoid it, it does solve a very unintuitive behavior of current closure iterators).

dumblob commented 3 years ago

Yep - and I'm afraid of the size of Env_foo_storage (or jmp_buf). But maybe I misunderstood something in Weave.

mratsim commented 3 years ago

Yep - and I'm afraid of the size of Env_foo_storage (or jmp_buf). But maybe I misunderstood something in Weave.

In Weave this corresponds to TaskDataSize which is an arbitrary 144 bytes (https://github.com/mratsim/weave/blob/30baced/weave/datatypes/sync_types.nim#L19).

144 bytes was chosen so that most data can be intrusive to the task without heap allocation, it's 18 ints/pointers, there is even enough space to carry 4 32 bytes hashes.
It was also chosen so that task data + the framework metadata fit in 256 bytes (Weave metadata takes about 88 bytes) as 256 is a nice power of 2 and is already quite big (4 cache lines).

As a comparison, according to https://tel.archives-ouvertes.fr/tel-01549158/document, Akka uses 600 bytes per actor and Erlang 2500 and in Pony 240 bytes

Note that theoretically Tasks can be anything as the scheduler implementation relies on the following concept/interface: https://github.com/mratsim/weave/blob/30baced/weave/datatypes/prell_deques.nim#L10-L20

type
  StealableTask* = concept task, var mutTask, type T
    ## task is a ptr object and has a next/prev field
    ## for intrusive doubly-linked list based deque
    task is ptr
    task.prev is T
    task.next is T
    # A task has a parent field
    task.parent is T
    # task has a "fn" field with the proc to run
    task.fn is proc (param: pointer) {.nimcall.}

A fixed size was chosen because for efficiency and overhead reason, a memory pool trumps malloc/free.

Unless continuations are capturing large hashes or big integers on the stack, the environment should be way smaller than Weave's. There aren't many situations were you need to pass 18+ local variables across suspension points.

dumblob commented 3 years ago

Unless continuations are capturing large hashes or big integers on the stack, the environment should be way smaller than Weave's. There aren't many situations were you need to pass 18+ local variables across suspension points.

And this is exactly what I absolutely did not anticipate. I'd expect continuations to have the environment about the same size as Akka (or bigger) on modern CPU architectures. That's why I thought continuations can't be faster than Weave. But I'm glad it's the other way around :wink:.

shayanhabibi commented 3 years ago

This RFC seemed to show a lot of promise and benefits of CPS. I have recently started using it, and now understand that a rift in the community seems to have delayed it's integration into nim? At least that's what it seems to be on the surface.

AFAIK cps would be greatly improved with compiler support; I am just a lowly user so forgive my stupidity in asking, but I am really interested in trying to advocate for any progress in CPS after using it: What is progress/the timeline of advancing CPS integration?

It does seem really daunting at first, and I have made some stupid things on it as pointed out by others, but it is an absolute joy to use and I find it to be much easier to reason about the code that I produce with it. It's also pretty bloody fast. I truly feel like I am more in control using CPS. Mostly because I can so easily tie it together with other dispatchers etc.

Thanks.

nim-lang / RFCs

next steps for CPS #295

Fear of Threads

Macro Complexity

Poor Documentation

CPS

Performance

Composition

threads:on

color:on

leaks:on

gc:orc

asyncInside™

asyncOutside™

except:

What CPS `is`

What CPS `isnot`

git blame

Threads

Async/Await

CPS

Action Items

Enough of the talking, time to coordinate

nim-lang / RFCs

next steps for CPS #295

Fear of Threads

Macro Complexity

Poor Documentation

CPS

Performance

Composition

threads:on

color:on

leaks:on

gc:orc

asyncInside™

asyncOutside™

except:

What CPS is

What CPS isnot

git blame

Threads

Async/Await

CPS

Action Items

Enough of the talking, time to coordinate

What CPS `is`

What CPS `isnot`