ylluminate commented 4 years ago

Transferred from a document posted here in these documents by @cristian-ilies-vasile:

V concurrency high level design

After a carefully consideration of proposed concurrency models and suggestions expressed on discord v-chat channel the high level design is based on GO language model (message passing via channels) which in turn is a variation of communicating sequential processes.

I did read papers on actor model but seems that coders do not use all the primitives provided by language and resort to threads and queues (see Why Do Scala Developers Mix the Actor Model paper).

Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

Because there are many words used more or less interchangeably like coroutine, goroutine, fiber, green threads etc I will use the term green thread.

Key points

the language will provide primitives able to deal with green threads – the green threads monitor/scheduler will be part of the language
there will be no yield() points in code; the scheduler can suspend / resume / block / cancel any green thread in a preemtive mode
based on memory model the green thread will be stackless or stackful
one green thread could spawn n others green threads
a green thread must be reentrant
the scheduler could start a new instance of a green thread in order to accommodate load bursts (elastic computing)
Green threads communicate using message passing (channels) instead of shared variables

Open points

mux/demux green threads on real threads how to?
structured concurrency
- Martin Sústrik - http://250bpm.com/blog:71
- Roman Elizarov - https://medium.com/@elizarov/structured-concurrency-722d765aa952
- Alan Bateman - https://wiki.openjdk.java.net/display/loom/Structured+Concurrency
depending on application scope could have cpu bound or io bound or both type of green threads; do we need 2 different approaches?
Debugging
how to catch programs bugs (Heisenbugs) before the code is shipped to prod systems?
Formal methods / TLA+

dumblob commented 4 years ago

@helto4real wow, thanks for such effort!

I think for TCC I'd look at MUSL stdatomic because it provides fallback functions (using locks, of course) if the compiler doesn't support atomics. MUSL does use Linux futexes, but it might give you a notion how to implement it for Windows.

For -freestanding (bare metal) this might be yet another challenge - but for now I'd focus on these major platforms.

helto4real commented 4 years ago

@helto4real wow, thanks for such effort!

I think for TCC I'd look at MUSL stdatomic because it provides fallback functions (using locks, of course) if the compiler doesn't support atomics. MUSL does use Linux futexes, but it might give you a notion how to implement it for Windows.

For -freestanding (bare metal) this might be yet another challenge - but for now I'd focus on these major platforms.

Thanks that is a good tip! Will do!

cristian-ilies-vasile commented 4 years ago

If one goes to oracle labs (https://labs.oracle.com/) there is a search field on top of the page Searching for concurrency or locks or threads will get back a couple of nice papers.

cristian-ilies-vasile commented 4 years ago

For hard to find C bugs (undefinded behaviour, use after free, memory leaks etc), there is a way based on new Oracle's AST interpreter (Truffle), run time compiler (Graal) and virtual machine called graalvm. There is also an Clang IR interpreter named Sulong, so is perfectly resonable to execute low-level/unsafe languages (C, C++) on GraalVM

Papers: Sulong and Thanks For All the Bugs / http://ssw.jku.at/General/Staff/ManuelRigger/ASPLOS18.pdf LLVM IR IN GRAALVM: MULTI-LEVEL, POLYGLOT DEBUGGING WITH SULONG / https://llvm.org/devmtg/2019-04/slides/TechTalk-Kreindl-LLVM_IR_in_GraalVM.pdf SULONG An experience report of using the "other end" of LLVM in GraalVM / https://llvm.org/devmtg/2019-04/slides/TechTalk-Schatz-Sulong_an_experience_report.pdf

cristian-ilies-vasile commented 4 years ago

A short list with C/C++ source code static analyzers, just in case. http://frama-c.com/ (the software can run the Mthread plug-in - but this plug-in is cover by a proprietary licence) http://cppcheck.sourceforge.net/ http://svf-tools.github.io/SVF/ https://cpachecker.sosy-lab.org/index.php https://phasar.org/phasar/ https://github.com/nasa-sw-vnv/ikos

cristian-ilies-vasile commented 4 years ago

@dumblob Hello. If you have time please take a look at this presentation related to Threading on the Mill CPU Talk by Ivan Godard – December 5, 2017, at Facebook http://millcomputing.com/blog/wp-content/uploads/2017/12/threading.02.pptx

NOTE: you need Microsoft PowerPoint or a good emulator like LibreOffice. Thank you.

ylluminate commented 4 years ago

Interesting links from @cristian-ilies-vasile:

@fantassin noted that perhaps some ideas could be drawn from Intel's Thread Building Blocks:

https://software.intel.com/content/www/us/en/develop/tools/threading-building-blocks.html

Some discussion: https://discordapp.com/channels/592103645835821068/592106336838352923/727539427144106015

cristian-ilies-vasile commented 4 years ago

https://discord.com/channels/592103645835821068/592106336838352923/727539427144106015

I want to propose the block keyword, the usage will be block variable_name or block(variable_name) like already cast operation ie f64(x_point) From this point in time, the compiler is aware that a former mut or atomic variable becomes read only, so it is safe to have n pointers to that object, passed down to thread which in turn just consume data.

I did propose a way to "cast" mut capability to a read only / const one.

mut age := 20 println(age) age = 21 println(age) block age or block(age) // now age is immutable

age = 22 // error rised by compiler "age is immutable, cannot assign value 22 to age"

cristian-ilies-vasile commented 4 years ago

Optimistic concurrency with OPTIK Abstract

We introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. OPTIK relies on the commonly-used technique of version numbers for detecting conflicting concurrent operations. We show how to implement the OPTIK pattern using the novel concept of OPTIK locks. These locks enable the use of version numbers for implementing very efficient optimistic concurrent data structures.

https://dl.acm.org/doi/pdf/10.1145/3016078.2851146 https://dcl.epfl.ch/site/optik

helto4real commented 4 years ago

This is a very nice talk explaining the goroutines and scheduling in go. It also explains why lock-free not was chosen. Also how they solved the problems around syscalls and coroutines.

https://www.youtube.com/watch?v=-K11rY57K7k

cristian-ilies-vasile commented 4 years ago

and the presentation https://assets.ctfassets.net/oxjq45e8ilak/48lwQdnyDJr2O64KUsUB5V/5d8343da0119045c4b26eb65a83e786f/100545_516729073_DMITRII_VIUKOV_Go_scheduler_Implementing_language_with_lightweight_concurrency.pdf

cristian-ilies-vasile commented 4 years ago

I did some research and found an interesting approach for avoiding locks to be hammered in excess by threads and improve scalability.

Presentation / Generic Concurrency Restriction https://labs.oracle.com/pls/apex/f?p=LABS:40150:0::::P40150_PUBLICATION_ID:5413

Paper / Avoiding Scalability Collapse by Restricting Concurrency https://arxiv.org/abs/1905.10818

In this paper, we introduce GCR (generic concurrency restriction), a mechanism that aims to avoid the scalability collapse. GCR, designed as a generic, lock-agnostic wrapper, intercepts lock acquisition calls, and decides when threads would be allowed to proceed with the acquisition of the underlying lock. Furthermore, we present GCR-NUMA, a non-uniform memory access (NUMA)-aware extension of GCR, that strives to ensure that threads allowed to acquire the lock are those that run on the same socket.

cristian-ilies-vasile commented 4 years ago

@dumblob hello, For elasting computing / autoscaling capability we need to compute statistics for each thread and channel at least.

I find this paper Circllhist -- A Log-Linear Histogram Data Structure for IT Infrastructure Monitoring https://arxiv.org/abs/2001.06561

The circllhist histogram is a fast and memory efficient data structure for summarizing large numbers of latency measurements. It is particularly suited for applications in IT infrastructure monitoring, and provides nano-second data insertion, full mergeability, accurate approximation of quantiles with a-priori bounds on the relative error. Open-source implementations are available for C/lua/python/Go/Java/JavaScript.

Might this circllhist data-structure help keep track of latency over time for each thread? Thank you.

dumblob commented 4 years ago

Circllhist

Wow, that's a cool structure, I like it (I've been looking for something like that for I/O scheduling for block storage offering rather linera than parallel access).

I don't think though it fits anything we're seeking here. It's still quite bulky (https://github.com/circonus-labs/libcircllhist/blob/18f1b97603757d5e65baefa65b90d6ec182e8a30/src/circllhist.c#L152-L155 ) for our purposes, it won't scale enough with number of go routines (my goal would be tens of millions of go routines on a modern 4-core big.LITTLE ARM SoC for smartphones) and 99% of information it provides would probably stay completely unused during scheduling.

If we'll ever need anything "advanced" regarding "history", then the first thing I'd look into would be some kind of Kalman filter.

But I'm pretty certain we won't need anything else than what we already have to achieve satisfactory performance in V 1.0 - i.e. time measurement (given by scheduler invocations) and current filling of channels. Imagine heuristic like a simple "flood" algorithm but iteratively deepening from the most stalled places where one scheduler invocation would do next flood algo iteration only if the situation didn't get much better and would do flood restart if the change did seem to worsen the overall state).

Overall it seems the hundreds of links posted here directly or indirectly slowly more or less converge to links & ideas in the main thread. Which is generally a good sign if it doesn't cause too much duplication (e.g. the Vyukov presentation, resources about green threads, etc.).

Hereby let me encourage everybody to carefully look at all links in the main thread before posting any new links here to avoid (sort of detrimental) thoughts like "the other thread is old, let's focus only on this new stuff here because there is not much time".

cristian-ilies-vasile commented 4 years ago

@dumblob I did promote on discord V's channels always both tickets and people have been invited to subscribe and express ideas or concerns on both, so do not worry :)

Going back to many links, this is a deliberate action from my side, trying not to pollute original entry and add here more or less valuable resources.

cristian-ilies-vasile commented 4 years ago

Bounding data races in space and time – part I https://blog.acolyer.org/2018/08/09/bounding-data-races-in-space-and-time-part-i/

Bounding data races in space and time – part II https://blog.acolyer.org/2018/08/10/bounding-data-races-in-space-and-time-part-ii/

cristian-ilies-vasile commented 4 years ago

I did some research regarding NIM’s multi threading engine named Weave https://github.com/mratsim/weave Embracing Explicit Communication in Work-Stealing Runtime Systems Designing an ultra low overhead multithreading runtime for Nim

Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing. Quote: Picasso is not optimized for real-time or latency: It can be used in latency critical applications like games however there is no fairness or no system like a priority queue to ensure that a job finishes in priority.The goal is to process the total work as fast as possible, fast processing of individual pieces of work is a side-effect.

I know that V fans are working in the gaming industry - so main coders should decide what type of concurrency is best suited for language and applicability domains. We can pursue Weave goals or if low latency is paramount then a full preemptive solution should be designed.

However Weave is freaky fast On matrix multiplication, the traditional benchmark to classify the top 500 supercomputers of the world, Weave speedup on an 18-core CPU is 17.5x while the state-of-the-art Intel implementation using OpenMP allows 15.5x-16x speedup.

dumblob commented 4 years ago

Posting in hurry.

Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing. Quote: Picasso is not optimized for real-time or latency: It can be used in latency critical applications like games however there is no fairness or no system like a priority queue to ensure that a job finishes in priority.The goal is to process the total work as fast as possible, fast processing of individual pieces of work is a side-effect.

This is not updated - Weave actually has an optional centralized queue to bring in certain degree of fairness more than acceptable for game and similar use cases. @mratsim, could this be rephrased on the Picasso/Weave sites/docs/wikis?

full preemptive solution should be designed.

You know me already :wink: - I'm definitely a proponent of fully preemptive solution to make V flawlessly interoperable with anything in the world (all those C/C++/FFI/... libraries).

There are "tags" (like [inline] etc.) in V which could later be used as programmer covenant with the compiler, that the given routine doesn't need to be preempted and if all parallel routines will be tagged like that, then a non-preemptive scheduler can be used instead of the default preemptive one. But that's a very specific case and I'm confident it shouldn't be part of V 1.0 (even the tiniest embedded chips have counter with interrupt and thus allow full preemptiveness which should be cheap with the proposal in the last paragraph of https://github.com/vlang/v/issues/1868#issuecomment-634516673 ).

cristian-ilies-vasile commented 4 years ago

Maybe we can borrow some Weave’s design decision (channel “plumbing”, work stealing) in the preemptive run time. Also splitting potential go fn() workload in 2 main partitions: a) cpu bound / aka sensitive to throughput b) io bound + 3 levels of priority (Low, Medium & High) / aka sensitive to latency is not a weird idea anymore. We can address specific issues in the best manner for each.

cristian-ilies-vasile commented 4 years ago

For Those About To Rock The concurrency’s discussions topic on Discord has been launched! https://discord.com/channels/592103645835821068/743149495818387476

shelby3 commented 3 years ago

I don’t do Discord so I have no access to what is being discussed there.

a green thread must be reentrant

~~I presume this is some attempt to achieve data race safety for data that is referenced by more than one green thread?~~[I found out it is some idea for duplication of a thread presented by @dumblob)

This is quite onerous in terms of what a type system will need to prove to fulfill this guarantee. Why can’t each green thread behave as if it is a sequential green thread and then the data race safety issue can be handled separately for shared data, such as by making it immutable or implementing some reference capabilities with ownership for the shared data such as for the Pony language?

Green threads communicate using message passing (channels) instead of shared variables

No shared memory access between threads? If yes, then that will be very limiting.

With massively multicore arriving now, and the inevitable cost to sharing memory accesses across a communication bridge between a cluster of cores, there be a need for limiting sharing of memory to a cluster of cores. There may still be a use case for sharing memory within a cluster?

~~If not sharing memory then what is the reason for the reentrant stipulation?~~

mux/demux green threads on real threads how to?

Golang presumably has a huge investment in this tuning. There are issues such as starvation pressure versus latency. This is very tricky to get right. Should you not be trying to lift much of the implementation from Go instead reinventing the wheel with your presumably much more meager resources? Google is presumably running Go on thousands of servers so has a significant incentive to invest in the runtime.

Here’s some links to a collection of the research, thoughts and notes I collected on M:N concurrency:

https://github.com/keean/zenscript/issues/17#issuecomment-416825734

https://github.com/keean/zenscript/issues/17#issuecomment-280900711

https://www.reddit.com/r/scala/comments/ir8ygb/what_is_missing_in_scala_ecosystem/g57axaz/?context=6

how to catch programs bugs (Heisenbugs) before the code is shipped to prod systems?

Are you asking what the data race safety model should be?

If you are referring to bugs due to multithreaded contention, don’t go there. It’s an insoluble, hornet’s nest. Opt for a lockfree model. There will always be lock contention at some higher-level semantic but then that is not your concern.

Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.

I need to read this next. I will post what I have for this comment first.

It also explains why lock-free not was chosen.

I will need to watch that, thanks.

Edit: I found the reference at ~11:30 juncture in the video. Okay so it’s apples-to-oranges in terms of what I am thinking should be lock-free. Dmitry is referring to the scheduler for the M:N user-threads (aka green threads). I am referring to the API that the Vlang programmer interfaces with. I am saying the the Vlang programmers should be prevented or strongly discouraged from using low-level thread synchronization primitives. And instead should employ other lockfree means for (avoiding) contention over shared resources. I elaborated in the other thread on this topic.

I also found this:

https://stackoverflow.com/questions/13102874/is-gos-buffered-channel-lockless

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above.

dumblob commented 3 years ago

@shelby3 thanks for reaching out to us!

No rush here, V parallelism is work in progress so read slowly and carefully both threads (this one and https://github.com/vlang/v/issues/1868 ) and all referenced resources & links recursively. Then think of it for a few days, come back, clarify or adjust your concerns expressed above and we'll be more than happy to discuss open questions (if there are any :wink:) and accept PRs.

There should be enough time (look at the frequency with which both threads progress and correlate with commits in master), so don't hurry, we'll wait. Thanks.

shelby3 commented 3 years ago

Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency.

I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above.

Now I remember I already incorporated that into my analysis in 2018:

https://gitlab.com/shelby/Zer0/-/issues/11#spawning-goroutines-posited-to-be-as-harmful-as-goto

https://github.com/keean/zenscript/issues/17#issuecomment-359338947

The structured concurrency is applicable to when you want to explicitly express short-lived, concurrent operations which are considered to be children which should complete before the algorithm in the parent thread continues. Because in this case there are data race issues to contend with as the forked paths may return in any timing order.

Whereas, for goroutines which are to be long-lived entities no longer owned or subservient to the thread which spawned them, then structured concurrency does not apply and only appropriate data race safety restrictions on shared data apply.

IOW, I suggest we need both go and the nursery concept.

cristian-ilies-vasile commented 3 years ago

I read an interesting, 2020 issued, down to Earth paper on how to identify rare and hard to detect errors & bugs in multi-threading code (locks, unlocks, mutexes, semaphores, atomics etc) Is very fast and effective to catch bugs, see page 4 of paper. Sthread: In-Vivo Model Checking of Multithreaded Programs / https://hal.inria.fr/hal-02449080/document The point is to use SimGrid software under Linux (pthreads and windows... not a nice match) is not so obvious stated on paper. SimGrid / https://simgrid.org/ Other paper related to first one is a presentation, see slide 47 and next ones / http://www.ccs.neu.edu/home/gene/dmtcp-msr-19.pdf

cristian-ilies-vasile commented 3 years ago

On V's concurrency front we start working on a Poof of Concept based on Weave project.

shelby3 commented 3 years ago

@cristian-ilies-vasile finding and fixing needle-in-a-haystack thread synchronization heisenbugs is not equivalent to a paradigm which can never create those bugs. The latter would be provably 100% free from these bugs at compile-time. I urge you please do not repeat the design mistake of encouraging the use of thread synchronization primitives. I urge you to only allow access to these in unsafe{} code. Safe code should be provably safe at compile-time. As I explained herein and in the companion thread #1868 there are shared data paradigms which are provably safe at compile-time such as immutable shared data. The Vlang compiler could enforce it by for example only allowing threads to share references to immutable shared data.

Just as you would not allow pointer arithmetic in safe code, you should not allow in “safe code” a paradigm famous for creating insidious heisenbugs because it is well known to not be safe at all and to always proliferate insidious bugs. I believe Eric Raymond (re-)coined or first used (in 2014) the term “defect attractor” to describe intentional means to create defects in software.

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits. It’s impossible to thoroughly reason about thread synchronization because it is a dynamic synchronization problem and Turing-complete programs can’t be proven to halt (the Halting problem). Thus these unseeable heisenbugs are not equivalent in genre to other kinds of bugs which can be in many cases reasoned about and prevented (or at least discovered) by human brains and eyeballs or some formal methods, without even running the program.

Additionally as you presumably know that thread synchronization defeats massively multicore (which has arrived already for servers) due to Amdahl’s law (especially in conjunction with amplified contention due to increasing numbers of contending threads). Really you want to embrace immutability, it is the only viable direction going to the future. And Golang does not have any capability for expressing immutability at this time, so this can be a significant advantage for Vlang.

Another reason to favor “future-proof” immutability is what I wrote in the companion thread:

I had posited in 2018 that immutable shared data will scale better to the inevitable future massively multicore shift away from universal hardware cache coherency, because the cache invalidation can be unified with the message between threads to update their copy.

I have contemplated other benefits to immutability.

P.S. I have not read about Weave yet.

EDIT: Eric Raymond wrote which supports my warning:

While the ability to manage lots of concurrency fairly elegantly does pull Go in the direction of what goes on on a modern multi-core processor, CSP is a quite abstract and simplified interface to the reality of buzzing threads and lots of test-and-set instructions that has to be going on underneath. Naked, that reality is unmanageable by mere human brains – at least that’s what the defect statistics from the conventional mutex-and-mailbox approach to doing so say to me.

elimisteve commented 3 years ago

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits.

@shelby3 How, specifically?

krolaw commented 3 years ago

I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits.

@shelby3 How, specifically?

Consider that a variable maintaining a value for security is being left in an indeterminate state from a data race. Targeted parallel calls cause races, causing variable to go undetermined, causing access to be granted when it shouldn't be. Attack may need to be sustained depending on likelihood of success. Memory is often reused in strange ways, the variable may not even be part of the race directly. Undefined behaviour, is undefined including leaving the vault unlocked.

cristian-ilies-vasile commented 3 years ago

The first 250 simple tasks successfully ran on the R U S H proof of concept (inspired by Andreas Prell's PhD thesis) .

fn x_sum(a int, b int, output_ch chan string)  {   // original function
    x:= a + b 
    output_ch <- ''
    output_ch <- strconv.v_sprintf("x_sum= %04d", x)
    output_ch <- ''
}

dumblob commented 3 years ago

@cristian-ilies-vasile could you put all the sources to one gist to make it easier to comment on and discuss it (with the nice side effect of not polluting this thread)? Thanks :wink:.

dumblob commented 3 years ago

A recent note on the above-discussed topic of the difference between IO-bound computation and CPU-bound workloads by the author of Weave: https://nim-lang.org/blog/2021/02/26/multithreading-flavors.html

cristian-ilies-vasile commented 3 years ago

@dumblob Thank you for the link. I had parked the weave implementation due to some personal reasons. However in 2021 the bottleneck is the memory bandwidth, not storage/disk/SSD see these fresh posts below. So still think that a Weave adjusted for IO could be a better approach to concurrency or no concurrency at all at the language level, and use new platforms like Microsoft dapr (dapr.io) or lunatic (https://lunatic.solutions/)

o Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation

The new wave of thinking: One thread-per-core architecture o Seastar is the first framework to bring together a set of extreme architectural innovations / http://seastar.io/ o The Impact of Thread-Per-Core Architecture on Application Tail Latency / https://helda.helsinki.fi//bitstream/handle/10138/313642/tpc_ancs19.pdf?sequence=1 o Glommio - a Thread-per-Core Crate for Rust & Linux / https://www.datadoghq.com/blog/engineering/introducing-glommio/ o What is io_uring? / https://unixism.net/loti/what_is_io_uring.html

dumblob commented 3 years ago

or no concurrency at all at the language level, and use new platforms like Microsoft dapr (dapr.io) or lunatic (https://lunatic.solutions/)

These platforms (and many other) need to be programmed though in some language which needs to do all the heavy lifting :wink:.

The new wave of thinking: One thread-per-core architecture

Hm, how is this different from what we've discussed so far (i.e. one os-thread per physical core and then millions of fibers/tasks per such os-thread using work stealing with some hacks providing preemptivness to avoid starvation and some level of fairness)?

cristian-ilies-vasile commented 3 years ago

It is different a little bit.

there are no user level threads/fibers -> the abstraction is called task, which at low level is just a function, a structure, few pointers and one or two channels.
there is no preemption, all tasks are kept in stacks, one per logical CPU.
the wave design is a good match for scientific / CPU bound applications.
the scheduler is smart and is able to move one task from busy CPUs to the free ones.

That means once a task is put on a CPU, you cannot suspend that, just wait to finish. For a lot of tasks the load on each resource is fairy distributed due to work stealing algo.

dumblob commented 3 years ago

That means once a task is put on a CPU, you cannot suspend that, just wait to finish. For a lot of tasks the load on each resource is fairy distributed due to work stealing algo.

If it's really like this, then this has a serious issue with starvation - imagine writing few tasks each with an infinite loop (accidentally). It's also "not fair" and thus no real-time guarantees (required by games, audio/video calls, etc.) can be made.

Weave has some counter measures (has some "good enough" fairness and if it's not enough, it provides a shared input queue; regarding starvation I think Weave doesn't do anything, but there is an easy solution with longjmp to provide real-time guarantees on top), so it's not an issue.

But if you're talking about something else than Weave, then this starvation and unfairness is a show stopper for a default concurrency backend in V.

elimisteve commented 3 years ago

@dumblob Is it clear how much overhead Weave requires in order to do more sophisticated scheduling?

dumblob commented 3 years ago

Is it clear how much overhead Weave requires in order to do more sophisticated scheduling?

Yes - just see the benchmarks it Weave repo. It's actually basically fully amortized (not kidding!). So the only question is always: which Weave parameters should I choose for this particular set of tasks in order to trade (tail) latency for throughput and vice versa. Weave defaults are though pretty much usable even for games, so this parameter choosing shall happen only in extreme scenarios (e.g. someone writes her own kernel or audio/video signal processing server like JACK, etc.).

cristian-ilies-vasile commented 3 years ago

@dumblob My RUSH project is 80% like the original design and implementation done by Andreas Prell, on his PhD thesis, so is a cousin of Weave,

I put here some non audited figures, just to have a perspective. Weave uses the word worker and I use the word robot for the same piece of code. The figures are taken from different runs on different machines, do not try to correlate them.

A - load on each robot robot	num_tasks_exec	robot_run_function_duration (micro sec)	delta tasks %	delta timing %
1	39,024	687,698	-2.44%	-0.61%
2	42,060	750,044	5.15%	8.40%
3	37,339	635,525	-6.65%	-8.15%
4	41,577	694,485	3.94%	0.37%
total	160000	2,767,752
avg	40000	691,938

B - tasks distribution spread by robot's state If you do not account the last 3 entries, because the robot was sleeping, the overall overhead is less than 5%.

task	duration (micro secs)
robot_fsm_crunch_tasks_duration	302944
robot_fsm_move_tasks_on_deque_duration	8768
robot_fsm_1st_read_reqs_duration	946
send_notif_to_chief_robot_long_duration	635
send_req_to_peer_robot_duration	98
robot_fsm_1st_reqs_and_tasks_deq0_duration	64
robot_fsm_1st_fwd_reqs_duration	11


robot_fsm_ask_for_task_lr_duration	744
robot_fsm_wait_and_sleep_duration_reqs	14293
robot_fsm_wait_and_sleep_duration_tasks	16668

dumblob commented 3 years ago

On the way to an ultra low overhead equal-priority hard real time preemptive scheduler

Motivated (and surprised) by https://twitter.com/v_language/status/1388154640768851969 I've decided to finally research and write down my ideas I have in my head for about a year.

TL;DR It turns out V could have zero overhead full preemption on Linux and Windows ~Subsystem for Linux~ and bare metal (and maybe also some BSD) platforms.

As described in https://github.com/vlang/v/issues/1868#issue-489433130 V could (should!) leverage combining cooperative scheduling of "go routines" among a dynamic pool (1:1 to number of currently powered on CPU cores) of system-level threads ("os threads") with os-triggered preemption (bare metal has counter/clock interrupt). This os-triggered preemption ("interrupt" - e.g. a signal) is though costly if not necessary (with cooperative scheduling majority is not necessary). It's the costlier the more real time responses are needed which is typical for low-latency scenarios like audio, video, games, etc. - all that are coincidentally domains which also utterly seek high-performance languages to leverage the HW up to the last bit and watt. And that's the domain where V shines or wants to shine.

Now, when ~WSL supports seamless Linux interaction (incl. games with high fps) through WSLg~ Microsoft announced support for eBPF directly in Windows (https://github.com/Microsoft/ebpf-for-windows ), it becomes viable to leverage eBPF in the Linux/Windows/BSD kernel (I'd recommend first reading some eBPF programs BCC - e.g. https://github.com/iovisor/bcc/blob/master/examples/tracing/sync_timing.py ).

The idea is, that user space can write to kernel space (thanks to eBPF) without any interrupt nor any context switch nor any locking nor any other restriction. Just write at the pointer address basically. eBPF app (to whoms kernel space memory the user space app can arbitrarily write) can be then run each time a kernel timer(s) fires (doesn't matter who has set up this timer - whether some kernel driver or some other user space app or whatever - as we just care about being woken up "frequently enough" - in case of NO_HZ there might be the need to set some in-kernel periodic timer according to the user space app needs though).

If there'll be a free running counter in eBPF app kernel space incremented only by a given user space app thread and the eBPF app will maintain a "cached" copy of this counter from last time the eBPF app ran together with monotonic timestamp of the cached value, then by comparing this timestamp it can decide whether it's already too long the user space app thread didn't increment the counter and can organize an interrupt of the given user space app thread. The user space app thread increments the counter each time the cooperative scheduling on that particular CPU (virtual) core happens. Note, that no slow atomic operations (reads, increments, etc.) are needed here as we don't care about the edge cases as in the worst case it'll just fire the interrupt negligibly more frequently.

Of course, for thousands of CPU cores there might arise the need to somehow increase efficiency of the eBPF app - maybe by grouping the timer interrupts (though hrtimers in Linux should support tens and hundreds of thousands of timers without performance degradation), but that's just an implementation detail.

Why should this be possible? Because eBPF nowadays supports:

maps (which is the term designating the shared memory address space between kernel space and user space)
attaching e.g. to perf:perf_hrtimer/timer:hrtimer_expire_entry (for periodic running of eBFP program; for multiple timers per PID struct perf_event_attr could be used to tell them apart)
sending a signal
running in non-root user space (eBPF was originally designed to be run only as root in user space) - this seems to be nowadays the default
running on FreeBSD (though probably with some restrictions potentially hindering this usage)
running on ~WSL (thus on Windows though indirectly - but that doesn't matter anyway to those wanting to squeeze the maximum performance out their HW)~ Windows natively

Implementation note: on *nix systems sending an interrupt signal to the given thread seems to be a bit more work than expected (first the signal is being sent to the process and the signal handler will most then most of the time need to "distribute" the signal to the appropriate thread: signal handlers are per-process but signal masks are per thread).

Note, this is a generic solution and I'm certain readers from other projects (incl. Go lang, D lang, Nim, Python, ... and other languages and frameworks) might jump on this and try to implement it as low hanging fruit (hey, an attribution would be nice guys :wink:).

@mintsuki @UweKrueger @cristian-ilies-vasile thoughts?

Additional notes

The described scheduler is the simplest naive form and thus doesn't support e.g. priority groups which might (or migth not - depending on many other factors) be somewhat important instead of "blindly using it as os-level scheduler in in-V-written operating systems (i.e. big bare metal apps)".
This brings the burden of eBPF compiler, but no burden on distribution as eBPF apps shall be platform-agnostic if not using unstable trace points (trace points are basically hooks into various parts of Linux/BSD/Windows kernel).
There'll be some work needed for the scheduler to instruct the eBPF app to accommodate in running time after e.g. some CPU cores initially unavailable appeared (either not powered on or simple physically added or virtually added by the underlying hypervisor). This shouldn't be needed for CPU core removal/power_down (incl. readdition of the same CPU core) as there won't be much overhead with checking these "superfluous" stale counters in the eBPF app.

dumblob commented 3 years ago

Interesting - just 3 days after I published the above sketch, Microsoft announced support for eBPF directly in Windows - see https://github.com/Microsoft/ebpf-for-windows . And if they'll support some timer hooks (and I'm 100% sure they will), then I think there is no excuse to not writing a scheduler leveraging that sooner or later :wink:.

vlang / v

V Concurrency Considerations #3814

Optimistic concurrency with OPTIK Abstract

On the way to an ultra low overhead equal-priority hard real time preemptive scheduler

Additional notes