Proper support for distributed computing, parallelism and concurrency

dumblob commented 5 years ago

There is one thing in V I do care a lot about - parallelism & concurrency for seamless local and distributed programming.

Note, I didn't want to write this to https://github.com/vlang/v/issues/561 as the scope I'm describing is broader.

I'm confident every new language must provide first-class support for both parallelism and concurrency. Concurrency to allow one physical CPU thread do cooperative scheduling and parallelism to allow spawning the same go/v routine on several physical threads. This is actually what go lang does (when GOMAXPROCS is set to 2 or more and the underlying operating system supports threads) except for always spawning just one go routine disregarding the number of physical CPU cores (which is where V should improve on - see below).

Why both concurrency and parallelism and not either of them?

In the world of big.LITTLE, GPU/FPGA offloading, CPUs with tens/hundreds of physical cores (and hyperthreading on top), mobile devices trying to save as much energy as possible, NUMA supercomputers, mobile & close vicinity networks versus optical fiber networks etc. it's inevitable to dynamically schedule computation directly in runtime of each application (not just operating system wide which is too coarse and thus inefficient and not just cooperatively which would use just one physical CPU core per application which makes sense only for power saving scenarios, but nowhere else).

We have instruction-level paralellism covered (it's not yet present in V - see e.g. my rant about restrict - but it's a "solved problem" in todays compilers). The rest of the parallelism is just about "how close can we get to pure dataflow programming". Which appears to be kind of difficult.

Because instruction-level paralelism is solved, the best approach nowadays seems to be to:

write as much serial code which allows being highly optimized by instruction-level parallelism at once in one go/v routine (typically a tight loop or one tree of nested tight loops or a slow I/O or offloading to GPU/FPGA or similar) - the outermost construct of this go routine will be always an endless loop yielding (calling the cooperative/parallel scheduler) every N iterations
then schedule these go/v routines in cooperative manner on one physical CPU core with fixed-size queue (single-producer-single-consumer "SPSC" - for performant implementations see discussion below) in between these go/v routines (this queue is called a channel in go lang); the queue size might be between the size of L1 and 2*L2 cache as suggested in paragraph 3.3 in Analyzing Efficient Stream Processing on Modern Hardware.
and if any go/v routine shall become a bottleneck during computation (i.e. consumer is slower than producer for some time - this can be easily monitored by the cooperative/parallel scheduler as metric "how full are the queues"), a new instance of the slow consumer go/v routine shall be spawned (with respect to reentrancy) in a different operating system thread (i.e. on a different physical CPU core) including all the plumbing work (e.g. spawning an additional go routine acting as multiplexer getting the producers outgoing data and multiplexing it to the original as well as the new instance of the consumer; and spawning a demultiplexer go routine to join the outgoing data from both consumers) - this everything would the cooperative/parallel scheduler do.
and if the queues shall become almost empty for some time, remove the multiplexers and demultiplexers.

The cooperative/parallel scheduler as referenced above shall be a user-definable routine (if not present, a default built-in scheduler working similarly like described above will be used). This scheduler shall run preemptively (in its own thread?) and get as input arguments at least the following: thread handles, pointers to all go/v routines and corresponding queues between them, pointer to the built-in scheduler (in case the user just wanted to wrap it), and pointer to user-defined metadata (e.g. statistics allowing better scheduling judgement). This would allow for really complex scheduling on embedded systems as well as NUMA systems or any other highly dynamic systems (e.g. a smartphone could leverage being shortly connected to a cluster of other computers or smartphones and offload some slow algorithm there etc.). See e.g. MultiQueues.

This way one can write application once and deploy it to your grandmas wrist watches as well as to a supercomputer (assuming the user-level scheduler is a "bit more" clever than outlined above - imagine complexity similar to an SQL query optimizer for a distributed DB). It's also a highly failure-tolerant system (imagine Erlang which supports even an online update of the whole application in memory while serving all clients under full load without interruption!). Also as you may have noticed, go lang does not scale first because there is no spawning of redundant go routines (those which will be bottle necks) to other physical CPU cores and second because go doesn't have a dynamic user-influencable scheduler (taking into account advanced statistics, power consumption, etc.).

There is one catch though. If implemented naively, then such elasticity doesn't guarantee global ordering among channels/queues (ordering works only inside of one channel) and real life scenarios in IT are unfortunately more often than not relying on ordering. This has to be accounted for and will require user intervention (i.e. be syntactically explicit). Either in the form of knowledge where to (a) insert "tagging" of items before they'll be spilled among different channels and (b) where the "tagging" will be removed and used to assemble the original ordering. Or in the form of assigning a strict priority to each channel. Or using other scheme.

See also lock-free and wait-free datastructures (state of the art as of now) which might be handy for an efficient implementation of parallelism.

IMHO we could actually postpone the implementation of concurrency and stick with just parallelism (as it's easier to implement than the interleaving between concurrency and parallelism as outlined above), implement the queue and a basic scheduler and first then (maybe even after V 1.0) get back to concurrency. As of now we have parallelism without queues and without the scheduler, so we actually already have a good starting point.

Table of parallelism possibilities we have nowadays (just for reference):

parallelism kind	suited for execution duration	latency overhead	suited for how frequent execution startup	internode bandwidth quality has influence	special parallel SW architecture required	requirements for execution & deployment
CPU non-JIT vectorization on a CPU core	very short to long	~none	frequent	none (it's just 1 node)	no	binary for the particular CPU
CPU JIT vectorization on a CPU core	short to long	low	moderate to frequent	none (it's just 1 node)	no	app source code + VM binary for the particular CPU
CPU cores utilization (thread, process, ...)	long	low to moderate	moderate	lower to none (it's just 1 node)	yes (except: pure objects as actors or alike)	binary for the particular CPU
accelerators (GPU, FPGA, ...)	long	low to moderate	moderate	lower to none (it's just 1 node)	yes (except: array/matrix/tensor/...-based languages)	binary for the particular CPU and accelerator (GPU, FPGA, ...)
supercomputer with NUMA topology	long	moderate	sporadic to moderate	moderate (hypercube/... is really fast)	yes (except: pure objects as actors, etc., array/matrix/tensor/...-based languages)	binary for the particular CPU and accelerator (GPU, FPGA, ...)
LAN cluster	long	moderate to high	sporadic to moderate	moderate to high (can be fast, but not always)	yes	binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)
WAN cluster	long	high	sporadic	high (basically can't be that fast)	yes	binary for at least 2 CPUs and/or at least 2 accelerators (GPU, FPGA, ...)

(in practice these might overlap and/or be used simultaneously)

P.S. The dynamic scaling over the different kinds of parallelism as outlined in the above table is called "fine grained distributed computing". And if anything from the above proposal sounds crazy to you, then I can assure you, that the world doesn't sleep and there is at least one seamless (fully dynamic) solution offering first class fine grained distributed computing - Unison.

Other things to consider and/or track:

As of March 2021 the fastest publicly known (and most comprehensive & general) parallelism & concurrency backend/library is Weave
use PRNG (pseudo-random number generator) which copes well with dynamic threading (i.e. doesn't suffer from "same seed leads to same random numbers in all threads") - see e.g. splittable PRNGs such as JAX
maybe infinite loop in rand https://github.com/vlang/v/commit/a7c84834f4ba4948b8102a05b603bf8d51484760#r39632493
g_str_buf with parallel access:
- https://github.com/vlang/v/commit/778a1cc34ae5674c345d73e2a507bce9f77dea1b#commitcomment-36848204
- https://github.com/vlang/v/commit/0f92800921c706de0bd872f7eff30047a99c734d
FPU stuff needs special attention from the threading (work stealing) scheduler - see https://github.com/mratsim/weave/issues/163
under the hood, we might utilize some wait-free and lock-free algorithms (see e.g. https://github.com/pramalhe/ConcurrencyFreaks )
reconsider deadlock/livelock/no_thread_running/... detection - see https://github.com/vlang/v/issues/10340
incorporate a "sharded" RwMutex into the standard library and use it also for V's built-in "sharded" data structures (map etc.)
memory models of multicore architectures (x86, ARM, POWER, ...) are still sometimes not fully formally defined but SW memory models are even worse at that :open_mouth: https://research.swtch.com/mm

gslicer commented 5 years ago

Wow, you should write a full master thesis around this topic one day ;) For me I would prioritize this topic as follows: Prio1: provide safe multi-threading/scheduler (no data races, locks, semaphores etc. -> ponylang.io) Prio2: provide spawning and killing of new processes for SMP, where each can be multithreaded Prio3: provide processes for non CPU-ressources like MMX/SSE,GPU/CUDA,PhysiX,RayTracing you name it Prio4: provide distributed computing (cluster nodes over networks) -> julialang.org (of course this is something one could code even himself having good standard libs for networking in V)

dumblob commented 5 years ago

Wow, you should write a full master thesis around this topic one day ;)

I already did many years ago (wow, the time flies) - but on a (very) different topic :wink:.

Prio1: provide safe multi-threading/scheduler (no data races, locks, semaphores etc. -> ponylang.io)

I'd say let's keep things simple and actually rather provide no locks, no semaphores etc. to "force" programmers stick with the builtin parallelism. Of course, there will always be the unsafe module providing raw locks etc., but that has nothing to do with parallelism in V.

Prio2: provide spawning and killing of new processes for SMP, where each can be multithreaded

There shall be just one construct - imagine go from golang, but nothing else. The point is to completely forget about SMP etc. - the programmer shall be actually strongly discouraged to think that the platform running her app will be capable of SMP and instead encourage her to think about fully dynamic behavior.

Prio3: provide processes for non CPU-ressources like MMX/SSE,GPU/CUDA,PhysiX,RayTracing you name it

There is no need for this - if one would want to offload stuff, then she would just use the unsafe or ffi module for interfacing with e.g. a CUDA library. In other words instruction-level parallelism shall IMHO not have any explicit support in V (instruction-level parallelism is already supported in V through patterns the same way as in C99+ to which V compiles - we shall though "nudge" the compiler to do more vectorization and maybe even use smart auto-annotation SW). If utterly needed, then explicit instruction-level parallelism could be added the same way as C11/C++17/non-standard intrinsics (__m128i _mm_or_si128 _mm_cmpeq_epi8 and tens/hundreds of others).

Prio4: provide distributed computing (cluster nodes over networks) -> julialang.org (of course this is something one could code even himself having good standard libs for networking in V)

Actually rather not. There shall be no need for this - that's the responsibility of the programmer who will define her own cooperative/parallel scheduler (as I outlined in my first post). This might be a simple scheduler, but also - as you pointed out - something super complex adding tens/hundreds of new nodes, many new multiplexers and demultiplexers to the huge graph of go/v routines with queues scaling over heterogeneous clusters (for a moderately complex scheduler see internals of Python Dask). So this can't be part of V.

gslicer commented 5 years ago

@dumblob I see the "go" keyword as a builtin for multi-threading only (lightweight processess) - V could however try to loadbalance the threads automatically among available CPU cores?

Fully agree, that higher layers of parallelizm should be spinned off the the standard-libs (or even seperate lib-projects written in V)

dumblob commented 5 years ago

I see the "go" keyword as a builtin for multi-threading only (lightweight processess)

My proposal is about changing it to a similar semantics as in go lang - i.e. the go keyword would spawn a go/v routine (aka CSP thread or green thread or fiber or you name it).

V could however try to loadbalance the threads automatically among available CPU cores?

If you refer to operating system threads, then there is no need for V to loadbalance them as the operating system does it. If you refer to go/v routines, then sure, V will need a scheduler which shall be aware of the number of physical CPU cores available (disregarding whether tiny ones or fast ones like e.g. in big.LITTLE), spawn this number of threads and schedule go/v routines among them).

dumblob commented 5 years ago

As @aprell pointed out in https://github.com/nim-lang/RFCs/issues/160#issuecomment-529201833 , there is a nice and very current material describing the golang scheduler. It's a must read for those trying to implement an efficient scheduler for the mix of parallelism & concurrency. Some downsides of the golang scheduler are the following:

it treats all operating system threads uniformly assuming each thread runs on one physical CPU core (or "hyperthread" on hyperthreading CPUs), but this is not true on big.LITTLE or in cases of power management (see my comment https://github.com/nim-lang/RFCs/issues/160#issuecomment-529371631 ) - golang could though solve this in the future by moving running goroutines among threads (there are some tricky corners like thread local storage which would need to be atomically moved to the other thread as well, etc., but these are solvable)
there is no concept of "reentrant" goroutines in golang and thus it's impossible to spawn new instances (or kill running ones) and do the muxing/demuxing as outlined in my initial post above to accommodate for bottle necks - this is not so quickly solved in golang compared to the issue (1) above (as it would probably require extending golang) and I think this could be the major advantage of V compared to golang

Otherwise golang implements pretty closely what I described in my initial post above - so it would make sense to "port" all the parallelism & concurrency logic (which is not exactly a trivial code) to V and start improving on top of it.

medvednikov commented 5 years ago

Thanks a lot @dumblob

This is the last unresolved thing in V for me, so all this info is very helpful.

I'll start working on this in early October, but if anyone is willing to do this earlier, they are more than welcome.

gslicer commented 5 years ago

so it would make sense to "port" all the parallelism & concurrency logic to V and start improving on top of it.

I hope you mean whith: parallelism => preemptive & concurrency => cooperative multitasking.

For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use.

I hope one day we can state about V as follows (taken from the Pony-project):

"It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.

It’s deadlock free. This one is easy, because V has no locks at all! So they definitely don’t deadlock, because they don’t exist."

That's all from my side for now.

dumblob commented 5 years ago

I hope you mean whith: parallelism => preemptive & concurrency => cooperative multitasking.

Sure, I do.

For me these are two distinct use-cases where in the first one I rely on the OS to "load-balance" my threads among the available (by change hyper-threaded) CPU-cores and in the second case I want the control over which parts of code have higher priority to be executed (co-routines) even if I do not know a priori how much CPU cycles or wait-time this particular instruction block will use.

I don't understand why these 2 should be 2 different use cases. For me it's definitely just one use case - I want things to run purely in parallel (concurrency won't help me at all). But because parallelism is super expensive (memory consumption as well as context switching of operating system threads doesn't scale into millions), I have to combine it with concurrency by leveraging all the more knowledge I have from the programmer (unlike an operating system kernel, which doesn't know what the programmer intended and thus has to waste memory and context switches on fully featured operating system threads) to make also go/v routines (i.e. concurrent threads) behave like truly parallel operating system threads. Thus combining the preemptiveness of operating system threads and the memory and context switching efficiency of go/v routines.

I really don't see any point in having concurrency if we have true parallelism with the same amount of resources and comfort.

Mind, that my proposal gives one full power over scheduling by exposing the scheduler to the programmer.

"It’s data-race free. V doesn’t have locks or atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.

I'm pretty sure this will not be possible (see latest research). At least not with the current Vs semantics. Such guarantees can be achieved through constraining the programmer (see e.g. Rust as already many times mentioned here in V's discussions) or through making the language so incapable that it'll get very impractical :cry:. What can be though achieved in V is to make races highly improbable (the syntax & semantics will guide the programmer into writing more or less race-free code) and on top of that provide e.g. a race analyzer as golang does (and run it automatically as part of compilation?). This is the difference between guarantee and high probability, and it's enormous.

Besides, if one utterly needs plain concurrency, it's easy to do it yourself (there are many implementations on github as it's really an easy high school exercise). Even on embedded systems I always try to use first one big event loop and if that's not enough, then I resort directly to interrupts (which are costly, but provide true preemptiveness) instead of using any coroutines, because they won't bring me anything more than an event loop and they're about as complex as true interrupts with preemptiveness.

oleg-kachan commented 5 years ago

upvote for parallelism + SIMD vectorization as requested here https://github.com/vlang/v/issues/1079

dumblob commented 5 years ago

@medvednikov some implementation ideas:

there is a good looking attempt to improve on a single producer single consumer ring buffer by avoiding the extra empty slot in the buffer, but especially avoiding any modulo (which is very slow instruction) - see Memory and Speed efficient ring-buffer design in https://github.com/nim-lang/RFCs/issues/160#issuecomment-549199561 ; the (hopefully) final implementation of a lock-less unbounded Multiple Producer Single Consumer queue (ring buffer) is to be found in https://github.com/mratsim/weave/pull/21 Note though, that current implementation of V channels (vlib/sync/channels.v) uses something akin to SPSC with subscribing in run time. There is a potential alternative to SPSC ring buffer in the form of a "BipBuffer" (bipartite buffer) offering higher performance.
follow the yet incomplete design & implementation of a modern thread-safe object pool in https://github.com/nim-lang/RFCs/issues/160#issuecomment-549199561 (the objectives are really challenging and if delivered without any side effects, it'll be a leap forwards)
get inspired by ideas as well as by the actual implementation of mimalloc, the most stable and efficient (and probably fastest as of now) os-threads-friendly malloc() replacement which doesn't sacrifice single-thread performance

dumblob commented 5 years ago

@medvednikov a possible implementation of a highly efficient preemptive scheduling:

https://github.com/krinkinmu/userspace-threads (didn't do any performance benchmarks, but I expect it to be many times faster than os-level scheduling, but still somewhat slower than Go routines - this small difference to Go routines is the tax for being fully preemptive, i.e. guaranteed no deadlocks and not even starvation; another thing is, that it's easier to implement and allows for an easy API for poking the scheduler and adding V language primitive(s) for reentrancy as I outlined above in https://github.com/vlang/v/issues/1868#issuecomment-529391682 )
some theory to make the implementation correct: Communicating finite-state machine

Note also, that we can easily analyze in compile time the maximum stack size needed for each V parallel routine, so we don't need dynamic stacks (see also the cactus stack problem and an innovative solution https://github.com/mratsim/weave/pull/24 - worth reading!), but rather a fixed-size array of different size (known in compile time) for each V parallel routine. This saves a lot of trouble and allows for the best possible performance (yeah, for indirect & direct recursion, for FFI, and for unsafe stuff we actually need dynamic stacks which means tricks like the jump used in Go, but that shouldn't be very common in performance critical places, so we can afford some performance penalty in such cases).

dumblob commented 5 years ago

There seems to be an implementation effort having very similar goals (maybe except for the reentrancy which I'd definitely like to see in V as mentioned above) and with extensive survey of existing implementations (Go is mentioned as well as an interesting example).

https://github.com/mratsim/weave/issues/22

medvednikov commented 5 years ago

Thanks @dumblob I'll check these out.

cristian-ilies-vasile commented 4 years ago

@dumblob Could you please take a look at Chapel a parallel programming language (https://chapel-lang.org/ https://github.com/chapel-lang/ ) designed by CRAY engineers? Maybe you can borrow some design ideas from Chapel.

dumblob commented 4 years ago

@cristian-ilies-vasile I already did that. Chapel has a large set of very nice features (e.g. regions, multi-paradigm, etc.), but they don't seem to match V goals (especially the "there is only one way to do a thing") and second they're not generic enough (read: "Chapel is a pile of best practices & optimized principles for a - moderately large - selected finite number of use cases"). Due to this, Chapel programs do squeeze the maximum out of the given supercomputer, but can't be run enough efficiently on non-supercomputer hardware (which is the majority nowadays - see the table in the initial post above) as well as the hardware of tomorrow (yeah, stuff like several layers of remote memories with different access speeds instead of just local and remote etc.). Chapel is also quite complex language from my experience (despite authors saying the opposite :wink:).

I have to admit I couldn't find in Chapel nearly anything useful for V (the above proposal is generic enough to support quite efficiently nearly any of the Chapel abstractions implemented later in specific V libraries libraries - note especially the scheduler "callback").

If you have anything specific in mind, feel free to discuss it here though. I'll be more than happy to learn something or clarify or modify the proposal.

cristian-ilies-vasile commented 4 years ago

@dumblob Thank you for the nice words.

//1 - framework to eliminate bugs On this concurrency front my understanding is that more or less the GO green threads scheduler model (threads monitor) shall be used with improvements. However concurrency is hard. According with this paper "Understanding Real-World Concurrency Bugs in Go" [1] is it easy to induce concurrency bugs even using GO. Shall be great if compiler or a separate tool can check for such bugs and produce clever human readable error messages.

//2 - Function as a Service at run-time Quote To run foo() concurrently, just call it with go foo() On top of this functionality can we have vm foo() and then V compiler should create some sort of micro virtual machine/container, so the OS and in turn the hypervisor can move foo() on a different machine and execute the function there? That microVM can be wrapped around WASM binary format, so the processor of the remote server does not matter.

[1] https://songlh.github.io/paper/go-study.pdf Thank you.

Regards, Cristian.

gslicer commented 4 years ago

@cristian-ilies-vasile YES to //1 but NOOOO to //2 - V shall have no "runtime" or even VM in my opinion - you are free to build your own VM using V though

cristian-ilies-vasile commented 4 years ago

@gslicer I feel that you did not get it. It's not like Java and JVM, my suggestion is related to one specific case. The coder could decide that a long running function would run automatically on a different server/environment. The V ecosystem should provide tools and mechanisms for such approach on year 2020.

Think about C, it is still used after 50 years... so I presume that V lifetime'll be the same. In this case let's prepare the foundations to sustain this type of weight.

Regards, Cristian.

dumblob commented 4 years ago

//1...

Well, my proposal diverges from Go lang in a way, that they won't be cooperatively concurrent but preemptively parallel. This is more in line with the "standard threading" experience one gets on Windows or Unix (pthreads and alike) than with Go lang. So one can reuse the knowledge, tooling etc. to more easily reason about the program than one can about Go programs.

My proposal goes further though - I envision a builtin support (e.g. a dedicated syntax or even a new keyword) for "reentrancy" (a scheduler can spawn an arbitrary number of instances of a go routine any time and can kill and remove an arbitrary number of them also any time. Second, the programmer shall get the full power over scheduling (this might be achieved by many means, so I don't have any particular opinion on implementation - imagine e.g. the most error-prone but widely known: callback).

//2

This sounds somewhat similar to the reentrancy & scheduler "callback" I'm proposing, but my proposal is about the primitives needed (the language syntax & semantics) rather than about any built-in "high level" stuff like VMs. My goal is to offer the very minimum set of primitives to make it possible to efficiently implement something like you described in custom libraries, but not as part of V.

cristian-ilies-vasile commented 4 years ago

@dumblob Related to //1 could you please add a protection mechanism like Time to live (TTL) when a go routine is called, to set a maximum lifespan of that green thread? if TTL is 0 (zero) then is ignored otherwise for strict integer positive value then the V's green threads monitor/scheduler must return an error if that value is exceed.

dumblob commented 4 years ago

@cristian-ilies-vasile sounds like something easily doable in the app itself - in other words I don't see any obstacles in my proposal which would prevent implementing such TTL functionality efficiently in your app without any cooperation from V side. It also doesn't sound like a default behavior of V's parallelism, so I don't see currently any reason to add explicit support directly in V.

Or am I missing something? Please elaborate :wink:.

cristian-ilies-vasile commented 4 years ago

@dumblob It's up to you, however if we follow yours school of thinking ("asily doable in the app itself") the green thread / fibers logic should be placed at app level not as part and parcel of V run-time, because we shall operate only with low level primitives.

Indeed, one can implement Timing Wheels logic on their own, but wouldn't be more helpful for coders to use a rock solid language abstraction out of the box with TTL included?

Regards, Cristian.

P.S. For goroutines scheduler/monitor I feel that this blog [1] is worth to read, that data structures (Range-Encoded Bitmaps, Range-Encoded Bit-Slice Indexes) are implemented in GO [2] [1] https://www.pilosa.com/blog/range-encoded-bitmaps/ [2] https://github.com/pilosa/pilosa

dumblob commented 4 years ago

@cristian-ilies-vasile thanks for the examples. I have to admit I'm a bit lost here. My proposal in this thread is about a hybrid approach (basically I'm reusing the old concept of user-level preemptive threads - i.e. green threads / fibers, but non-blocking - i.e. working hand in hand with os-level scheduler and timers to wake them up regularly). So I'm not sure how does your concern about placing green threads / fibers on top of my proposal relates to this discussion.

Regarding Timing Wheel stuff it's primarily about the underlying operating system interface (especially timers capabilities) and has nothing to do with this discussion about parallelism in my opinion. Or maybe I'm missing something.

For goroutines scheduler/monitor I feel that this blog [1] is worth to read, that data structures (Range-Encoded Bitmaps, Range-Encoded Bit-Slice Indexes) are implemented in GO [2]

I'm not aware that Go used this data structure in their go routine implementation. Are you speaking about the official go compiler or about cgo (GCC-based) or something else? How does one use range-encoded bitmas and/or range-encoded bit-slice indexes for go routine implementation? The blog post you've linked is also 2 years older than the material describing the whole golang scheduler linked above.

It seems to me we're talking at cross purposes. Could you elaborate how does TTL relate to general scheduling?

If you're talking just about parallel computation in LAN clusters and WAN clusters (as mentioned in the initial description above), then I could understand the need of some TTL to prevent routing issues, but that's all supported in my proposal through the "callback" to your own parallel scheduler which'll be aware of the specific needs of your app and will therefore support TTL handling etc.

Of course, the more is at hand by default, the better for programmers, but TTL handling is so extremely specific (it relates just to a subset of use cases from the last 2 rows in the table in the initial post in this thread), that it should be totally fine if it's part of network libraries etc., but not part of V (by V is usually meant V and it's standard library, which is to be very small). Among the main differences between V and it's standard library versus additional libs (e.g. network libs) is the pace of development and guarantees of backwards/forwards compatibility and guarantees regarding multiplatform functionality (both operating systems as well as hardware itself).

cristian-ilies-vasile commented 4 years ago

Job Queue Scheduling with Bit Vector Indexes (http://bitmagic.io/bm-optque.html)

dumblob commented 4 years ago

@cristian-ilies-vasile oh, job scheduling. That's a bit different topic - I'm not proposing any jobs here, but rather a long/infinitely running go routines (imagine each go routine being an endless loop - analogous to a generic stream processor) having at least one input and zero or more outputs. This thread is then about how should the "work agents" (as they're called in the paper you've linked; the stream processors as I call them here) be implemented, but not about how should the work be scheduled among them (in my proposal I'm deliberately leaving this up to the operating system by default with the possibility to define your very own scheduler).

There could be definitely a library for scheduling as I mentioned above if e.g. the operating system scheduler won't be sufficient, but that's currently not the target of my proposal. Feel free to make a separate proposal for scheduling algorithms in this issue tracker :wink:.

cristian-ilies-vasile commented 4 years ago

If anyone wants to read the design document for coroutines in Kotlin you can find it here: https://github.com/Kotlin/KEEP/blob/master/proposals/coroutines.md

wanton7 commented 4 years ago

I've been using https://github.com/dotnet/orleans that implements virtual actor model from Microsoft and I think it's really awesome concept. It's used in Halo games backend services. Actors (grains in orleans) supports things like reentrancy and oneway. You can mark methods or whole grain reentrant. Example you can mark state reading methods are reentrant so they can read grain state while grain is saving something to database. Oneway work so that return from method is not waited and called method returns immediately.

This is all possible because of custom scheduler and async & await. I think https://github.com/AsynkronIT/protoactor-go is trying implement something similar. But not sure how well that will fit to Go's concurrency model.

What ever concurrency model is chosen for V, I hope it's something that can implement virtual actor model.

Omnuchka commented 4 years ago

What ever concurrency model is chosen for V, I hope it's something that can implement virtual actor model.

Coroutines more flexible then actor based model, you always can use coroutines model like actor if use one coroutine for one resource, no need to fetter yourself.

dumblob commented 4 years ago

Actors, coroutines, etc. of your choice - that all can be implemented/simulated using the "primitives" I outlined in my proposal.

Btw. async/await or alike (i.e. generally the asynchronous model) is by far not the best solution - it's just modern nowadays, but doesn't solve fundamental problems any better then other models (actually rather vice versa as there are things which can't be solved asynchronously and in addition it gets very error prone and verbose if applied everywhere).

elimisteve commented 4 years ago

@dumblob Would writing V programs that run on https://github.com/bastion-rs/bastion be good enough? (Very interesting project!) I'm curious how you're thinking about what must be a language feature and what could be part of an underlying platform/distributed runtime/etc.

dumblob commented 4 years ago

Would writing V programs that run on https://github.com/bastion-rs/bastion be good enough?

Didn't read the Rust sources (just skimmed through some examples), but it all sounds that yes, it would be good enough (though I'd need to review performance). On the other hand, we can definitely stay smaller in V - the Bastion runtime is not just "mutually compatible & performant tools/language_features to easily build such a system", but rather "existing ecosystem which you can readily deploy and use".

what must be a language feature

default scheduling of parallel/concurrent computations (both short aka "tasks" as well as long running aka "stream/infinite_loop")
spawn/exit a first, second, third, ... instance parallel/concurrent computation (this involves communication primitive plumbing if the communication primitive was used in the body of the parallel/concurrent computation)
one (not more!) efficient communication primitive between parallel/concurrent computations allowing construction of arbitrary oriented graph of data flow (including "multiplexing" - note there shouldn't be any need for explicit "multiplexing" and "demultiplexing" language keywords/constructs as the communication primitive will encompass that)
reentrancy support as an optional attribute of the given parallel/concurrent computation (e.g. an optional language keyword/operator/...)

What shouldn't be a language feature:

other than just one default scheduler
too specific or bulkier parallel/concurrency abstractions like actors, operating system threads (they have huge APIs and are not lightweight), processes of all kind (it's a way bulkier concept even if talking about Erlang processes than the absolute minimum parallel/concurrent computation), data flow nodes (it's not flexible enough as it disallows e.g. the short ("task") computations and allows just the long running "stream/infinite_loop" computations), ...
only non-preemptive parallel/concurrent computations
event loop (there are so many variants offering different functionality, that nobody would use the built in one anyway)

What might be a language feature (i.e. I'm not against having it built in V, but strictly speaking it's not necessary and could violate the V mantra of do one thing exactly one way):

futures as return values from short ("task") parallel/concurrent computations (but they might be also implemented as a tiny module on top of the above-mentioned building blocks) - please do not confuse with async-await/promise, because with async-await is usually meant an event-loop based asynchronous computation (which is by far not that convenient as the preemptive parallel/concurrent computations used as the basis of these futures) and a promise is "writable" (futures are read-only)

lygstate commented 4 years ago

How about async/await comes from rust? https://blog.rust-lang.org/2019/11/07/Async-await-stable.html

dumblob commented 4 years ago

@lygstate please see my comment https://github.com/vlang/v/issues/1868#issuecomment-586968033 above

lygstate commented 4 years ago

@dumblob Still not got the idea, async await may not be the best for performance, but easy to understand and easy to coding.

dumblob commented 4 years ago

but easy to understand and easy to coding

@lygstate I'm afraid I can't share this view - have you read the linked article about "everything async"? That's where this coding style leads :cry:.

Another issue is, that async/await has nothing to do with parallelism (async/await is just concurrency abstraction), but parallelism is the inevitable base building block to use more than one physical CPU core. See the initial post in this thread which explicitly says, that both are needed with emphasis on parallelism.

Second, you're free to create async/await easily in V itself in a highly performant manner and very syntactically pleasant even without any built-in support (btw. there is already async/await available - see e.g. how V UI is built around async/await scheduler).

lygstate commented 4 years ago

@dumblob Yeap, I am talking about a single thread async/await, not about parallelism. Is there any things better than async/await on single thread coroutine?

dumblob commented 4 years ago

Is there any things better than async/await on single thread coroutine?

Of course there is - writing your own abstraction of an async function which use an explicit fully dynamic stack and not an implicit fixed-size one (Go has a two fixed sizes, but that still can't compete with explicit management). This is then more performant and doesn't waste any byte of memory, though manual "stack" management is slightly more verbose and the programmer might sometimes feel like a living garbage collector :wink:. This is actually what V UI kind of does.

If you want to see some it more explicitly, take a look e.g. at simulation C++ libraries like https://www.fit.vutbr.cz/~peringer/SIMLIB/ (class Store which uses priority Queue as "stack"). Simulation libraries frequently implement something like a "functional object" with their own stacks in quite performant ways.

lygstate commented 4 years ago

Is there any demo of that?

for example.

How to do something like this

for (int i = 0; i < 10; i +=1)
{
    await sleep(1000);
    await doIoWork();
}

lygstate commented 4 years ago

@dumblob well, performant is important, but in production, productively and correctness are more important, we can sacrifise 20% performance for productively and correctness

wanton7 commented 4 years ago

@lygstate You should move this to official V language chat at Discord because every subscriber to this issue gets notification when you post here. These posts start to feel to me like beating a dead horse. Because there won't be async and await for V, its concurrency model is already decided to be similar to Go language by V language author, https://github.com/vlang/v/blob/master/doc/docs.md#concurrency .

I'm C# programmer by profession and it's a language that mainstreamed async & await and it leads to duplicate implementations in system libraries example. in .NET (C#) libraries you have lot of functions that have another version with Async ending like System.IO.File.File.ReadAllLines and System.IO.File.File.ReadAllLinesAsync. When using Go concurrency model from programmers perspective you always write synchronous code and runtime handles concurrency underneath so you don't have this code duplication. Async and await is also like a virus and it starts to infect your code from top down.

cristian-ilies-vasile commented 4 years ago

Entry link for Discord chat channels for V development & more is https://discord.gg/vlang

cristian-ilies-vasile commented 4 years ago

@dumblob What do you think about starting a basic implementation?

medvednikov commented 4 years ago

Yes, I think it'd be good to start with a basic implementation to have a fully working one by 0.3.

krolaw commented 4 years ago

I had written some undoubtedly outdated V code to provide go like channels and select functionality, put on hiatus until generic structs arrive. If we can have generic structs, go like inter thread communication can follow, and hopefully rust level safety next.

dumblob commented 4 years ago

@cristian-ilies-vasile thanks for the ping (I'm too busy to respond quickly, so please bear with me in these times)

@ylluminate @cristian-ilies-vasile @lygstate I saw your comments starting with https://github.com/vlang/v/issues/3814#issuecomment-633721380 and I'm quite puzzled.

Everything mentioned there is purely non-preemtive though the requirements most of us seemed to agree to do have preemptiveness as one of major goals. Writing something non-preemptive would be a huge step back even when compared to the current state of V.

That's why I also linked the preemptive implementation of extremely lightweight "user-space threads" above (see also https://github.com/krinkinmu/userspace-threads/issues/1 ).

What do you all think?

Just to clarify - I'm not against having "safe points" concurrency on top of a preemptive backend implementation, but I think we first need to implement the preemptive backend and then improve e.g. the scenario when the task is too short to consume the given time slice and instead of interrupting - which is costly - just switch to the next cooperative task and so on until the time slice is up (this would basically mean cooperative scheduling with real-time guarantees).

lygstate commented 4 years ago

@dumblob I mention purely non-preemtive if for also working under bare-metal and Web environment. I mean we need a infrustructre that works both preemptiveness and non-preemtive environment. A same set API for both target

cristian-ilies-vasile commented 4 years ago

@dumblob Hello, thanks for comment. We shall stick with initial goal implement the preemptive backend and then improve, as summarized in V Concurrency Considerations https://github.com/vlang/v/issues/3814 lygstate just put a lot of links to concurrency resources.

So, we are in sync with this task.

One more thing, do you prefer to communicate with a different channel on top of github comments? Thank you.

cristian-ilies-vasile commented 4 years ago

@krolaw I did send a ping on discord yesterday on this matter.

dumblob commented 4 years ago

@cristian-ilies-vasile thanks for clarification. As for communication, I'll probably stay just here (no time to hang on anywhere else). I think you'll anyway update these GitHub issues if there is anything important to discuss/mention to make it visible to everybody and not just discord users.

cristian-ilies-vasile commented 4 years ago

@dumblob Hello, there are 4 persons involved on this task on the hard side ie writing code: you, @helto4real,@krolaw and @UweKrueger and in order to make progress I feel that we shall have a proper understanding of:

//1 start drafting a technical design based on available ideas and agreements. Doing so we can split the main task in small tasks and see who can work on what and how fast.

//2 Another aspect is related to major milestones, what do you think that is mandatory for version 0.3 (short term vision) and how can evolve smoothy up to 1.0 the stable version.

//3 We need to make this concurrency interoperable at syntactic and semantic level with V language -> could need to implement new keywords / concepts which in turn will trigger new PRs. Keep in mind that on scanner-parser-checker - c code generator flow are working only 3-4 persons. Clarifying what reference capabilities V will have (if any ) and how light green threads should use these capabilities.

//4 open points: example Exactly that's the reason why I'm curious how would we implement the long running "background" tasks using the nursery concept in V as there are currently no means in V guaranteeing the existence of a proper defer. And having no guarantees totally defeats the purpose of the whole nursery concept wink.

Thank you.

vlang / v

Proper support for distributed computing, parallelism and concurrency #1868