Why not synchronic - Githubissues

jfbastien commented 8 years ago

C++ is considering a new primitive called synchronic which is similar to futex but has a few different tradeoffs, would work better on some hardware, and would be portable. There's a sample implementation with data, and @ogiroux is working on a new paper with an updated API.

It would be good to understand why synchronic isn't used JavaScript's shared memory:

Maybe synchronic itself would benefit from corrections.
Maybe it is a good fit, but is misunderstood (see demonstration below).
Will synchronic usage in C++17 (or whenever it's standardized) be mappable to JavaScript's primitives?

lars-t-hansen commented 8 years ago

There are some notes in the DISCUSSION.md document regarding "why futex and not synchronic", but more are in the original proposal. I'll move of those over to DISCUSSION.md and try to make sure everything is there, and then we can discuss that.

(@jfbastien, There appear to be no "demonstration below", only a png.)

lars-t-hansen commented 8 years ago

[This is the discussion thread on the original spec that gave rise to the present bug.]

jyasskin / 4:35 PM Mar 2:

+ogiroux@gmail.com has another API at http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf that's based on Futex-y primitives but may be easier to work with. It can be implemented as a library on top of what your proposal already has, so it's not essential, just something to be aware of.

Lars Hansen / 4:54 PM Mar 2:

Thanks, very helpful.

\ Hans Boehm / 9:00 PM Mar 2:**

I second Jeffrey's recommendation. Especially since futexes don't seem to be universally supported everywhere?

Lars Hansen / 7:30 PM Mar 3:

Futexes supply the API but not even on Linux were we planning to use the native futexes directly.

(I like the "synchronic" API much better than futexes and will investigate it as a replacement.)

JF Bastien / 5:43 PM Mar 4:

+1 on synchronic. On feasability: PNaCl exposes a restricted futex on Linux/Windows/OSX, and it works, but I agree that it should be an implementation detail of synchronic.

JF Bastien / 11:19 PM Mar 4:

For details on NaCl's futex, see: https://chromium.googlesource.com/native_client/src/native_client/+/master/src/untrusted/irt/irt.h

Lars Hansen / 10:50 AM Aug 21:

Synchronic works poorly with asm.js and has not been included in the new spec. There's a section on synchronic below that addresses this in more detail.

Olivier Giroux / 4:14 AM Aug 22:

The exact API for synchronic has been in flux lately. What I'd like to focus on, if you guys are interested in a discussion about it, are the semantics. I would like to make a case for weaker semantics than futex guarantees, primarily for performance reasons.

The only essential difference between a synchronic and a futex is that the former is allowed to just poll for a while and skip the OS involvement if the condition was met during polling. This difference means that a synchronic can sometimes be satisfied in <100ns, when the minimum cost for futex is more like ~50us. A good-QOI implementation of synchronic will rapidly fallback to a lower-energy mechanism to avoid wasting resources, which could involve both yielding and eventually the OS queue as a last-resort.

When you code against futex directly then you're on the hook to mix polling, yielding and futex to cover the range of temporal response. This is a sort of dark art. I'm sure if you apply yourself then you can all master it, but do all your users need to?

I've been developing the synchronic library to try to prevent that. The API isn't the most essential thing about it, it's the idea of putting a gasket on the complexity that most users don't need. Secondarily, I'm putting a gasket on the portability problems too - Mac OS X and iOS aren't so well equipped as Linux and Windows in this department.

I would love more feedback.

Cheers, Olivier

lars-t-hansen commented 8 years ago

(Visibility: @ogiroux @jfbastien)

I really like synchronic (the spec and the code I've seen so far, anyway - Olivier's original paper and the code that was on Google Code). For programming in plain JS, in contrast with asm.js, I would want to stay away from futexes and use synchronic cells as the primitives in most cases.

So far, though, futexes have seemed like the better common primitive API for asm.js and plain JS. There are a couple of reasons.

Futexes are simple - no cross-worker storage management is needed beyond agreeing on an address, and no procedural initialization is needed either, so the protocols invoved are simple. In contrast, a synchronic requires at least a couple of cells - arguably the size depends on the type of datum held in the synchronic and is variable - and may require some initialization, so the protocol is a bit more complicated. One of the uses for futexes and synchronic will be to implement pthreads for translated C/C++, with a potentially unbounded number of locks, so the complexity of these protocols is not beside the point, though I don't want to overstate my case here, I'm not saying it wouldn't work.

Also, I think we'll need plain atomics regardless of whether we have synchronic or not, so synchronic does not have secondary benefits in API size reduction or whatever.

Finally, when programming in plain JS we probably want synchronics to be objects, while in asm.js they can't be (no GC, for starters). This isn't a huge consideration, but having "two kinds of synchronics" isn't appealing, at least not yet.

At the same time, futexes are hard to use; I've implemented a synchronic API for JS on top of futexes and I've been experimenting with spin loops for performance, and of course the performance improves when I guess right, but the guessing is hard. It would be lovely to abstract that away. So I've been wondering whether it would be possible and desirable to keep the futex API with its simplicity of one location of known size and no initialization protocol, but to incorporate the fast-wait idea from synchronic into it. This probably would be similar to Olivier's expectUpdate primitive, I'm not sure yet.

jfbastien commented 8 years ago

Synchronic was pretty heavily revised by @ogiroux in P0126R0, and it looks like it's going forward with a few wording changes. I think the new interfaces will address some of @lars-t-hansen's concerns.

lars-t-hansen commented 8 years ago

The new API in that document is very compact compared with the old one (I'm surprised at some of the omissions actually) but I don't understand how it addresses any of my concerns, exactly (see my Aug 31 remark above).

Anyway I just want to note that I'd like to look into a synchronic API for JS in earnest after the January TC39 meeting, to see if we can find something that's workable for JS and asm.js. I've been using synchronics (implemented on top of futexes) in various sample and demo programs and in addition to the promise of better performance they are a lot less error-prone and cumbersome than futexes.

A good goal would be to hammer out a strawman API and a rough implementation by the end of February.

lars-t-hansen commented 8 years ago

@ogiroux / @jfbastien, I have some questions about the revised synchronic proposal referenced above.

The proposal states, "Invocations of the expect functions that are unblocked by the invocation of a notify function may re-evaluate the user-provided predicate and block again." Am I to take that literally -- the proviso only applies to the predicate form -- or is it an oversight in the proposal, and it applies also to the value form?

Is there an updated document for synchronic or is the 2015-09-24 version the latest?

Is there a prototype implementation of this proposal somewhere?

lars-t-hansen commented 8 years ago

Ah, looks like @ogiroux's github repo has current code.

lars-t-hansen commented 8 years ago

OK, so putting the current code and the current proposal together with Olivier's last comment above, I now understand better. The current synchronic proposal really is "only" a better (more performant) futex, it is not the flexible atomic cell with reliable signaling from the older proposal.

Since _expectupdate / _expect_updatefor are a drop-in replacement for futexWait and notify with the special case of a do-nothing user function is a drop-in replacement for futexWake, it's probably the case that these synchronics could replace the futexes in the current JS proposal without much risk. Given the promise of better performance they probably should, if we can get the API worked out.

We'd lose the direct ability to wake some number 1 < k < all of waiting threads, and futexWakeOrRequeue is probably lost too. I expect neither is critical.

jfbastien commented 8 years ago

Latest paper: http://open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0126r1.pdf

taisel commented 8 years ago

I'm fine with SHM being delayed in movement to production if it means continued API changes for performance and efficiency reasons. My own concern is ports to JS from C++, and I'm interested in the level of care for porting these API changes will bring about.

lars-t-hansen commented 8 years ago

@jfbastien, @ogiroux, @binji

A proposal for synchronics in JS is up. The intent is to replace futexes with these.

Comments and feedback on the proposal would be most welcome (notably see the Discussion section, where there are some items that are really questions and open issues), this bug is a good place for feedback. If all I hear is crickets I will contact some of you directly next week.

The proposal text is in this repo on the branch "synchronic", in tc39/synchronic.html.

binji commented 8 years ago

I like the proposal. Have you explored porting the emscripten pthread support using synchronics instead of futexes? I'm not too concerned about code that uses futexes currently; it sounds like you didn't find much.

OTOH, I am curious if there are many examples of implementations of various synchronization primitives using synchronics. There are many with futexes. Though maybe this is a good thing, as it will discourage people from implementing their own primitives. :)

lars-t-hansen commented 8 years ago

I like the proposal. Have you explored porting the emscripten pthread support using synchronics instead of futexes?

I have looked at that some. It's a mixed bag (this is from looking at the MUSL implementation that we use for Emscripten): Some places the library uses futex-in-a-loop to implement a synchronic, sometimes even with a preceding spin-wait. Other places it's more complicated.

I'm starting to believe that synchronics are complementary to futexes, or that futexWait is a variation on expectUpdate that checks its condition going in but not after that, but I'm not quite there yet.

OTOH, I am curious if there are many examples of implementations of various synchronization primitives using synchronics. There are many with futexes. Though maybe this is a good thing, as it will discourage people from implementing their own primitives. :)

I spent some time today building a lock and condition variable implementation on top of synchronics. Effectively - to get the desirable semantics - I ended up reimplementing in JS some of the logic from the futex implementation, notably in order to be able to wake just one waiter. That's not necessarily a problem, since synchronics got the job done, the code may well be faster than the futex-based implementation, and synchronics can be used for much faster custom synchronization. It just adds to the burden for the application programmer, if that's all we provide.

ogiroux commented 8 years ago

In my github there are example implementations of various locks and barriers. Easing the emulation of condition variables was not really a goal and I agree that you end up rolling your own futex in order to do that.

Just having support for the wake-one semantic carries a cost to performance even if you never use it. I don’t have a good sense of how often that is really necessary in practice. My sense is that most people will not write that code you wrote to emulate it, because they don’t actually need it - for example, you should not feel like you need it for locks and barriers.

Then, if I had a choice of two features to have, then I’d prefer {synchronic + condition variable} over {synchronic + futex}.

lars-t-hansen commented 8 years ago

FWIW: I am reasonably sure - having sketched out some code - that futexes as currently specified in the shared memory spec can be polyfilled with pretty good fidelity and (important) not very high complexity on top of synchronics. I want to finish that code and test it and benchmark it before I conclude, but I'm inclined to believe that futexes don't need to stay in the spec if we have synchronics.

lars-t-hansen commented 8 years ago

Plausible futex polyfill: https://github.com/lars-t-hansen/parlib-simple/blob/master/bleeding-edge/futex.js

lars-t-hansen commented 8 years ago

The feedback has been positive and I've satisfied myself that we can build what we need on top of synchronics, so I'll merge the synchronics spec into the main spec and present the new API at the next TC39 meeting. For the time being I will flag futexes in the spec as obsolete but I will leave them in for reference for the time being, after all they're implemented in Firefox and Chrome.

jfbastien commented 8 years ago

Awesome! We now expect synchronic to be in the C++ concurrency TS2.

lars-t-hansen commented 8 years ago

@pizlonator @bterlson @littledan: Any reactions to this proposal? Going once, going twice...

To summarize:

We integrate the current synchronic draft into the spec
We mark futexes as obsolete now and remove them from the draft once the synchronic spec has received some more attention (in a month or so)

pizlonator commented 8 years ago

I am ambivalent to this. :-)

-Filip

On Mar 16, 2016, at 3:37 AM, Lars T Hansen notifications@github.com wrote:

@pizlonator @bterlson @littledan: Any reactions to this proposal? Going once, going twice...

To summarize:

We integrate the current synchronic draft into the spec We mark futexes as obsolete now and remove them from the draft once the synchronic spec has received some more attention (in a month or so) — You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

littledan commented 8 years ago

@lars-t-hansen No particular objection, though I don't personally feel all that qualified to weigh the tradeoffs between these two primitives. The synchronic text looks fine at a first skim.

With such a major change still in progress ("in a month or so"), sounds like SharedArrayBuffer isn't quite ready for Stage 3 review yet, or am I misunderstanding?

jfbastien commented 8 years ago

Agreed that this is likely not ready for stage 3. Maybe we need a separate issue to track what's needed to get to stage 3?

lars-t-hansen commented 8 years ago

@littledan, I was hoping we could do a round of reviews before the March meeting (as time allows, for the people involved, since the feedback really is desired on my part) but there would have to be another round in the early-May timeframe, before the Munich meeting, to determine if the spec is baked.

@jfbastien, the stage 3 issues are all labeled as such in the issue overview and none will be closed until they are resolved. Do you need something else for that? The natural thing here would be to put the synchronics in now, then let them stew for a while and file new issues against them, all labeled as Stage 3.

pizlonator commented 8 years ago

I think that I’d like to take some time to understand this Synchronic vs. Futex trade-off better.

With futexes, I know how to implement:

super fast adaptive mutex
condition variable
super fast binary or counting semaphore
rw lock
etc

I know how to implement these things because Linux has done it and we did it in WebKit on top of our own userlevel futex-like abstraction.

With synchronic, I’m less confident. I’d like to become more confident before locking in my opinion.

Do y'all believe that the move to synchronic already has a lot of consensus? Can you share some thoughts about why synchronic is better?

-Filip

On Mar 16, 2016, at 11:05 AM, littledan notifications@github.com wrote:

@lars-t-hansen https://github.com/lars-t-hansen No particular objection, though I don't personally feel all that qualified to weigh the tradeoffs between these two primitives. The synchronic text looks fine at a first skim.

With such a major change still in progress ("in a month or so"), sounds like SharedArrayBuffer isn't quite ready for Stage 3 review yet, or am I misunderstanding?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-197461242

lars-t-hansen commented 8 years ago

@pizlonator, I wouldn't say there's "a lot of consensus", but reception has been generally positive among those who are watching this spec closely.

The issue with futexes (see paper linked from @jfbastien's comment above as well as earlier versions of that paper, higher up) is that they are fairly expensive, so to get good performance you must wrap machinery around them. That machinery is not directly available in JS and might anyway be platform-dependent (spinloops, micro-waits). It's not appealing to have to code it in JS since it's platform-dependent. Synchronics, being able to wait on a value change instead of a signal, bake that in, and can thus perform better than futexes in most code not written by absolute experts, and the platform dependencies are hidden in the implementation.

I did some experiments trying to make futexes faster by using synchronic-like mechanisms behind the scenes but performance did not come close to matching even fairly naive synchronics, and I decided not to pursue that further.

I've polyfilled futexes on top of synchronics (see a comment from me above) and I'm pretty happy with how that turned out. This has pretty much convinced me that it's not scary to drop futexes. I'd be especially interested in hearing your opinions here.

pizlonator commented 8 years ago

On Mar 16, 2016, at 12:01 PM, Lars T Hansen notifications@github.com wrote:

@pizlonator https://github.com/pizlonator, I wouldn't say there's "a lot of consensus", but reception has been generally positive among those who are watching this spec closely.

The issue with futexes (see paper linked from @jfbastien's comment above https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-185944586 as well as earlier versions of that paper, higher up) is that they are fairly expensive, so to get good performance you must wrap machinery around them. That machinery is not directly available in JS and might anyway be platform-dependent (spinloops, micro-waits). It's not appealing to have to code it in JS since it's platform-dependent. Synchronics, being able to wait on a value change instead of a signal, bake that in, and can thus perform better than futexes in most code not written by absolute experts, and the platform dependencies are hidden in the implementation.

I don’t see any data in the thing that @jfbastien linked.

In particular he links to this: https://github.com/ogiroux/synchronic/

The title of the link is "sample implementation with data”.

If you click on the link, I think you will agree that it’s just a bunch of files and no data.

Can you guys be more specific about what data you have?

I did some experiments trying to make futexes faster by using synchronic-like mechanisms behind the scenes but performance did not come close to matching even fairly naive synchronics, and I decided not to pursue that further.

I agree that futexes are meant to be on the slow path of some lock algorithm. It’s easy to write an efficient lock with futexes. I suggest looking at Rusty Russel’s examples. Unless you need things like priority inheritance, priority ceiling, or deadlock detection, you can just use one of the simple futex locks.

I don’t think we’ll want those advanced features like priority inheritance, but it’s interesting that there exists a story for how to do those things on top of futexes. Does synchronic even have a story for any of those advanced features?

I've polyfilled futexes on top of synchronics (see a comment from me above) and I'm pretty happy with how that turned out. This has pretty much convinced me that it's not scary to drop futexes. I'd be especially interested in hearing your opinions here.

I’m glad I asked because now I’m starting to think that the decision to drop futexes and use synchronic was not data-driven and based on some false presumptions, like the claim that futexes require platform-dependent machinery. In my experience, they don’t. I don’t see any data to suggest that there is any platform-dependence or that this platform-dependence goes away with synchronic.

Feel free to convince me otherwise with data. :-)

-Filip

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-197491743

lars-t-hansen commented 8 years ago

This is the latest paper for the C++ proposal and contains some motivation and a spec: http://open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0126r1.pdf. There is more background in an earlier paper, though the API is pretty different: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf. Both of those are linked from comments above.

The decision to explore synchronics is absolutely data-driven. An experiment using futexes in a naive way (in Firefox) to exchange values in a ping-pong fashion between threads showed futexes to be much faster than postMessage but still slow. I then experimented with speeding up their use, which amounted to (since it was just an experiment) a spinloop waiting for a value change before going into the futex, this showed significant speedup (these numbers are from memory, but roughly increasing message count 20x from 300K/s to 6M/s). Clearly this was still portable JS code, but the claim is that appropriate spin counts are platform specific and other mechanisms to get the best performance, such as relaxing within the spinloop, are simply not available in JS at all. Experiments bear this out: Native synchronics in Firefox (not in any current branch) are faster still; adding a PAUSE instruction to the spinloop in the native implementation significantly improved performance (10-20%, from 9M/s to 11M/s) over the JS implementation.

I don't know if this is the kind of data that you find interesting, but if not I'd appreciate some suggestions for other types of experiments.

Synchronics are probably a hair easier to use correctly than futexes though it's probably a wash. (Though, for some kinds of codes where they fit well they are probably significantly easier to use correctly.) But the motivation really has been to get better platform-independent performance.

pizlonator commented 8 years ago

On Mar 16, 2016, at 12:34 PM, Lars T Hansen notifications@github.com wrote:

This is the latest paper for the C++ proposal and contains some motivation and a spec: http://open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0126r1.pdf http://open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0126r1.pdf. There is more background in an earlier paper, though the API is pretty different: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf. Both of those are linked from comments above.

Neither of these papers have relevant data. The second paper compares to std::mutex, not futexes. I’m appalled by the low quality of scientific methodology used in that paper:

It ignores the fact that the fast path performance of uncontended mutex acquire is the single most important metric.
It does not measure CPU time wastage under contention for long critical sections.
It does not specifically measure performance under microcontention.
There are no error bars anywhere in sight.
They only consider one platform.

So, this data is not relevant to this discussion at hand.

Also, these papers appear to disagree on whether you need to call notify() or if calling store() is sufficient. Can you clarify this?

The decision to explore synchronics is absolutely data-driven. An experiment using futexes in a naive way (in Firefox) to exchange values in a ping-pong fashion between threads showed futexes to be much faster than postMessage but still slow. I then experimented with speeding up their use, which amounted to (since it was just an experiment) a spinloop waiting for a value change before going into the futex, this showed significant speedup (these numbers are from memory, but roughly increasing message count 20x from 300K/s to 6M/s). Clearly this was still portable JS code, but the claim is that appropriate spin counts are platform specific and other mechanisms to get the best performance, such as relaxing within the spinloop, are simply not available in JS at all. Experiments bear this out: Native synchronics in Firefox (not in any current branch) are faster still; adding a PAUSE instruction to the spinloop in the native implementation significantly improved performance (10-20%, from 9M/s to 11M/s) over the JS implementation.

Futexes are not intended to be used as the fast path. They are for the slow path. The goal is to use them to implement some kind of mutex based on one of Rusty’s algorithms.

It appears that synchronics also require you to wrap them in some kind of mutex that follows some kind of implementation idiom. The second paper you link appears to show that to use synchronic to implement a mutex, you must write your own spin loop. The same is true for implementing mutexes on top of futexes. Therefore, I don’t see a significant difference in ease-of-use.

I reject the claim that spin counts must be tuned on a per-platform basis because there is no data to back up this claim. Back in Jikes RVM we found that although there is some room for per-benchmark and per-platform tuning of the spin count, there is a large plateau of “good” spin counts between 10 and 1000. (FWIW I usually choose 40 whenever I implement a new lock, like the ones in WebKit.)

Finally, I just don’t really know what to make of your benchmark. Surely whether we pick synchronics or futexes the performance will only be good if they are used as intended. Nobody is promising a miracle where a badly written program performs magically just because we chose synchronics. I think it would be better to consider the canonical Rusty-style mutex written using the futex Atomics in your draft, and then examine how this performs versus that a canonical synchronic-style mutex. When we do this, we should do a proper benchmark that considers how people use mutexes rather than inventing something new. This means testing:

Uncontended acquire in the 1-thread case.
Uncontended acquire in the many-thread case.
Acquire with low probability of contention.
Microcontention.
Contention for long critical sections (along with CPU wastage data).
etc.

I have high confidence that it’s possible to make a good mutex on top of futexes. I have zero confidence that the same is true for synchronics, partly because the data I’ve seen on synchronics is not credible.

I don't know if this is the kind of data that you find interesting, but if not I'd appreciate some suggestions for other types of experiments.

Synchronics are probably a hair easier to use correctly than futexes though it's probably a wash. (Though, for some kinds of codes where they fit well they are probably significantly easier to use correctly.) But the motivation really has been to get better platform-independent performance.

Do you have data on the platform-independence? There is no such data in the papers you linked.

This still leaves me feeling that the switch away from futexes is not data-driven since the data collected so far is either not relevant or just plain bad.

That said, my objection to synchronics is soft to begin with. I’m just super surprised that you guys are making such a decision without having good data.

-Filip

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-197505976

pizlonator commented 8 years ago

I just looked at the synchronic spec some more as well as the available data, and I now believe that synchronic is the wrong approach and it will be harmful to web developers.

The most important performance metric for mutex implementations is the speed of uncontended acquire and release.

The one performance study (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf) shows that at best, synchronic is as fast as std::mutex for uncontended lock acquisition, but no faster. That’s not good. It’s easy to write a mutex using futexes that is much faster than std::mutex. std::mutex has many costs because of ancient POSIX rules. If a freshly written mutex on top of synchronic also has those costs, then something horrible has gone wrong.

Based on reading the spec, I know what has gone wrong: synchronic::update()/store() must consult a side-table to check if there are any waiters, or at least notify the runtime of the possibility that this update may have unblocked a waiter. This is inefficient. Futexes allow the fast path of unlock() to CAS as a way of checking if there are waiters, and then avoid doing any checks in any side-tables if there aren’t waiters. Rusty’s preferred way of doing this seems to be that the futex client uses a int32 to count the number of waiters or to have a wait bit, which tells you if there may be at least one waiter. My preferred approach is to have a wait bit (see http://trac.webkit.org/browser/trunk/Source/WTF/wtf/Lock.h#L99). It’s also fine to use futexes in a manner that allows thundering herd: basically, if the wait bit is set, then you wake everyone on unlock. I have data that shows that this is fine in practice, and it’s a lot easier to implement.

There is ample empirical evidence to suggest that there are many efficient lock implementations on top of futexes. We know why this is the case: futexes empower the client to avoid doing expensive operations on either lock() or unlock() if there is nobody waiting, while ensuring swift notification if there is someone waiting.

I don’t see any empirical evidence to suggest that synchronics can do this. Instead, I see a spec that seems to prevent you from implementing an efficient lock, and empirical evidence to suggest that synchronics perform badly (i.e. as slow as std::mutex).

Based on this, I strongly object to synchronics. Let’s use futexes instead.

-Filip

On Mar 16, 2016, at 1:02 PM, Filip Pizlo pizlo@mac.com wrote:

On Mar 16, 2016, at 12:34 PM, Lars T Hansen <notifications@github.com mailto:notifications@github.com> wrote:

This is the latest paper for the C++ proposal and contains some motivation and a spec: http://open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0126r1.pdf http://open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0126r1.pdf. There is more background in an earlier paper, though the API is pretty different: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4195.pdf. Both of those are linked from comments above.

Neither of these papers have relevant data. The second paper compares to std::mutex, not futexes. I’m appalled by the low quality of scientific methodology used in that paper:

It ignores the fact that the fast path performance of uncontended mutex acquire is the single most important metric.

It does not measure CPU time wastage under contention for long critical sections.

It does not specifically measure performance under microcontention.

There are no error bars anywhere in sight.

They only consider one platform.

So, this data is not relevant to this discussion at hand.

Also, these papers appear to disagree on whether you need to call notify() or if calling store() is sufficient. Can you clarify this?

The decision to explore synchronics is absolutely data-driven. An experiment using futexes in a naive way (in Firefox) to exchange values in a ping-pong fashion between threads showed futexes to be much faster than postMessage but still slow. I then experimented with speeding up their use, which amounted to (since it was just an experiment) a spinloop waiting for a value change before going into the futex, this showed significant speedup (these numbers are from memory, but roughly increasing message count 20x from 300K/s to 6M/s). Clearly this was still portable JS code, but the claim is that appropriate spin counts are platform specific and other mechanisms to get the best performance, such as relaxing within the spinloop, are simply not available in JS at all. Experiments bear this out: Native synchronics in Firefox (not in any current branch) are faster still; adding a PAUSE instruction to the spinloop in the native implementation significantly improved performance (10-20%, from 9M/s to 11M/s) over the JS implementation.

Futexes are not intended to be used as the fast path. They are for the slow path. The goal is to use them to implement some kind of mutex based on one of Rusty’s algorithms.

It appears that synchronics also require you to wrap them in some kind of mutex that follows some kind of implementation idiom. The second paper you link appears to show that to use synchronic to implement a mutex, you must write your own spin loop. The same is true for implementing mutexes on top of futexes. Therefore, I don’t see a significant difference in ease-of-use.

I reject the claim that spin counts must be tuned on a per-platform basis because there is no data to back up this claim. Back in Jikes RVM we found that although there is some room for per-benchmark and per-platform tuning of the spin count, there is a large plateau of “good” spin counts between 10 and 1000. (FWIW I usually choose 40 whenever I implement a new lock, like the ones in WebKit.)

Finally, I just don’t really know what to make of your benchmark. Surely whether we pick synchronics or futexes the performance will only be good if they are used as intended. Nobody is promising a miracle where a badly written program performs magically just because we chose synchronics. I think it would be better to consider the canonical Rusty-style mutex written using the futex Atomics in your draft, and then examine how this performs versus that a canonical synchronic-style mutex. When we do this, we should do a proper benchmark that considers how people use mutexes rather than inventing something new. This means testing:

Uncontended acquire in the 1-thread case.

Uncontended acquire in the many-thread case.

Acquire with low probability of contention.

Microcontention.

Contention for long critical sections (along with CPU wastage data).

etc.

I have high confidence that it’s possible to make a good mutex on top of futexes. I have zero confidence that the same is true for synchronics, partly because the data I’ve seen on synchronics is not credible.

I don't know if this is the kind of data that you find interesting, but if not I'd appreciate some suggestions for other types of experiments.

Synchronics are probably a hair easier to use correctly than futexes though it's probably a wash. (Though, for some kinds of codes where they fit well they are probably significantly easier to use correctly.) But the motivation really has been to get better platform-independent performance.

Do you have data on the platform-independence? There is no such data in the papers you linked.

This still leaves me feeling that the switch away from futexes is not data-driven since the data collected so far is either not relevant or just plain bad.

That said, my objection to synchronics is soft to begin with. I’m just super surprised that you guys are making such a decision without having good data.

-Filip

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-197505976

lars-t-hansen commented 8 years ago

Based on this, I strongly object to synchronics. Let’s use futexes instead.

Noted. Will get back to this issue next week, won't do anything radical with the spec in the mean time.

ogiroux commented 8 years ago

I have just finished updating everything about the C++ version of synchronic. That is still hosted here: https://github.com/ogiroux/synchronic.

The latest C++ paper (post-JAX version) is the file p0126r2.pdf. The wording is significantly improved over the last version.
The implementation is completely new. It has reasonable tuning for {Linux, OS X, Win7-, Win8+, and generic C++} but could still benefit from tweaking the magic numbers for each case.
Data is now provided in the form of the file data.pdf (graphs) and *.csv files (numbers). I tried to measure the specific things that Pizlo asked for, in this thread. I did not put error bars on the graphs but (1) in the test code I controlled for a large number of noise sources, and (2) anecdotally most results seemed to stay within 5% run-to-run.

Finally, I would like to re-iterate that the primary goal of synchronic is to help engineers write portable code to wait for a memory location to meet a condition. If there is a "well-known" set of best-practices around Futex, then synchronic aims to encapsulate that knowledge. On the premise that I didn't just get lost in the weeds, a quick scan through the include/synchronic header shows that there are a lot of best-practices and platform-specific concerns to be encapsulated.

Please feel free to reach out and ask me for more information.

pizlonator commented 8 years ago

On Mar 22, 2016, at 1:21 PM, ogiroux notifications@github.com wrote:

I have just finished updating everything about the C++ version of synchronic. That is still hosted here: https://github.com/ogiroux/synchronic https://github.com/ogiroux/synchronic.

The latest C++ paper (post-JAX version) is the file p0126r2.pdf https://github.com/ogiroux/synchronic/blob/master/p0126r2.pdf. The wording is significantly improved over the last version.

The implementation is completely new. It has reasonable tuning for {Linux, OS X, Win7-, Win8+, and generic C++} but could still benefit from tweaking the magic numbers for each case.

Data is now provided in the form of the file data.pdf https://github.com/ogiroux/synchronic/blob/master/data.pdf (graphs) and *.csv files (numbers). I tried to measure the specific things that Pizlo asked for, in this thread. I did not put error bars on the graphs but (1) in the test code https://github.com/ogiroux/synchronic/blob/master/test.cpp I controlled for a large number of noise sources, and (2) anecdotally most results seemed to stay within 5% run-to-run.

Can you explain what this is actually measuring?

I see bar charts. But what is 100%? 100% of what?

I will reply in greater detail in a bit, and that reply will focus on some analytical issues.

-Filip

Finally, I would like to re-iterate that the primary goal of synchronic is to help engineers write portable code to wait for a memory location to meet a condition. If there is a "well-known" set of best-practices around Futex, then synchronic aims to encapsulate that knowledge. On the premise that I didn't just get lost in the weeds, a quick scan through the include/synchronic https://github.com/ogiroux/synchronic/blob/master/include/synchronic header shows that there are a lot of best-practices and platform-specific concerns to be encapsulated.

Please feel free to reach out and ask me for more information.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-200004297

ogiroux commented 8 years ago

What is measured by the test program is the system throughput, in isometric steps per second, under each condition and a matching control. What that program outputs is the overhead of a step of the primitive, in seconds, relative to the control. What is plotted in the PDF is the ratio of the overhead versus that of std::mutex, hence std::mutex is always 100%.

pizlonator commented 8 years ago

I believe that Synchronic is the wrong choice here. We need a synchronization primitive that enables the implementation of parallel algorithms. Synchroinic doesn’t allow parallel algorithms. This is evident from both the spec and the existing implementations. We must not allow shared array buffers to preclude parallel execution.

Using a term of art, Synchronics don’t allow for data-access parallelism. The main feature of synchronics is that the “notify” operation is cheap which enables a simple programming model. But the cheapness of “update” is due entirely to the use of a single lock inside the synchronic object. The proposal for Synchronics in JS (https://axis-of-eval.org/shmem/synchronic-formatted.html https://axis-of-eval.org/shmem/synchronic-formatted.html) does not allow for the user to create more than one synchronic object (or one per shared array buffer), so all synchronic operations will be bottlenecked on a single global lock (or per-shared-array-buffer lock, which is just as bad). I don’t see how to allow for many synchronic objects that would be practical for asm.js. So, I don’t think we want synchronic.

Here’s the issue in greater detail. Synchronics and Futexes both have similar abstractions:

There is a “wait” operation.
There is a “wake” operation. (What synchronics called “update” and now call “notify” - I’ll use those terms interchangeably.)

The key difference between Synchronics and Futexes is this:

Futexes are designed to enable the implementation of high-performance system mutexes with no performance compromises.
Synchronics are designed to be easy to use - in particular, they are meant to support frequent calls to wake, since the slowness of futex wake means that you can’t use them the easy way (calling wake every time something changes).

It turns out that in order to achieve the ease-of-use, Synchronics preclude the the implementation from using the trick that Futex implementations use. If you read the semantics of wait/wake, in both cases they are modeled “as if” there was a single global lock. If futexes really did use a global lock then this would kill parallelism - every time any thread took a lock slow path, it would serialize with all other threads taking lock slow paths. You hit Amdahl’s wall a lot sooner when you do this, so of course futexes do something different: they use a concurrent hashmap. But using a concurrent hashmap means that “wake” can’t just do a quick test on some value in memory to determine if there is anyone waiting, like how synchronic does it: it needs to do a concurrent hashtable look-up first. In both of the implementations that I know about (Linux and WebKit), that concurrent hashmap lookup is quite heavy because it’s optimizing for reducing false sharing rather than optimizing for being called repeatedly in a loop. This is why futex “wake” is slow. It’s meant to be a scalable slow path, not a common fast path. It’s the slowness of futex wake that acted as a forcing function on the programming model: it has to be the client’s responsibility to avoid calling wake, therefore it has to be the client’s responsibility to store bookkeeping about the potential number of waiters in the futex, and so therefore it’s the client’s responsibility to manage the spin loop that guards the call to “wait”.

In other words, the primary optimization of futexes is not practically implementable in synchronics because synchronics encourage high-frequency calls into wake. If you tried to implement synchronics with a concurrent hashmap instead of a single global lock, then they cease to be practical: clients would have to avoid calling the slow path at all costs, just like they do now with futexes. Except that futexes make it easier to avoid calling wake, since they decouple storing values into futexes from waking. This implies that synchronics will always be implemented with a single lock per synchronic object, which translates to a single global lock for asm.js programs or any ES parallel code written in terms of a single shared array buffer.

Let me put it in simple terms: if you implement pthread_mutex_lock/unlock on top of a single Synchronic object that is shared by all users of an array buffer, then every call to unlock will contend with every other call to unlock, even if they are unlocking different mutexes. That’s totally bogus and we don’t want that. I’m surprised nobody on this thread has noticed this huge problem until now!

So: if we use synchronics as the fundamental notification primitive in this shared memory proposal, then we lose parallelism. Anytime someone calls “unlock”, the per-shared-array-buffer synchronic will do some kind atomic RMW operation on its “are there threads waiting” field, which means that all unlock calls within that shared array buffer will be serialized. That’s not cool.

I think that this also explains why the synchronic spec doesn’t have the kind of perf data that you’d expect from a locking paper - synchronics were never meant as a replacement for lower-level primitives when implementing production-quality locks. On the other hand, clients of ES will want a production-quality lock, so regardless of whether synchronics are provided as an alternative higher-level primitive we must also ensure that ES has a high-performance lower-level notification primitive that things like pthread_mutex could use.

By the way, it’s possible to implement a futex-like primitive in userland. Here’s WebKit’s userland implementation. http://trac.webkit.org/browser/trunk/Source/WTF/wtf/ParkingLot.h http://trac.webkit.org/browser/trunk/Source/WTF/wtf/ParkingLot.h http://trac.webkit.org/browser/trunk/Source/WTF/wtf/ParkingLot.cpp http://trac.webkit.org/browser/trunk/Source/WTF/wtf/ParkingLot.cpp

I have some sympathy for the argument that futexes are hard to use. I don’t believe that futexes are the right high-level synchronization primitive for users to use directly. But the most compelling short-term use case of shared array buffers is compiling C code that uses pthreads to asm.js. This means that the most common user of shared array buffer’s synchronization primitives will be pthread_mutex_lock/unlock. Those implementations will already know how to use futexes and they will definitely expect that locking and unlocking different mutexes on different threads will proceed in parallel. We don’t want to break this data-access parallelism, which is why we don’t want Synchronic.

I propose that we ignore the performance numbers of Synchronic as they are irrelevant. If we come forward with a proposal to enable concurrency in ES and that proposal precludes parallel programming then we will have a bad time.

-Filip

On Mar 17, 2016, at 4:15 AM, Lars T Hansen notifications@github.com wrote:

Based on this, I strongly object to synchronics. Let’s use futexes instead.

Noted. Will get back to this issue next week, won't do anything radical with the spec in the mean time.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-197828885

pizlonator commented 8 years ago

OK, that’s what I expected. See my last message. I believe that the problem here runs much deeper than these perf numbers.

-Filip

On Mar 22, 2016, at 1:39 PM, ogiroux notifications@github.com wrote:

What is measured by the test program is the system throughput, in isometric steps per second, under each condition and a matching control. What that program outputs is the overhead of a step of the primitive, in seconds, relative to the control. What is plotted in the PDF is the ratio of the overhead versus that of std::mutex, hence std::mutex is always 100%.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tc39/ecmascript_sharedmem/issues/12#issuecomment-200011879

ogiroux commented 8 years ago

The object specified in the C++ paper is intended to be used at a fine granularity (in limit, one instance per synchronization object) and not as a singleton covering for all of memory. Using one synchronic object with many atomic objects, while explicitly supported, results in the false sharing issue that you point out. There seems to be a large mismatch of assumptions on that.

If the key use-case for the TC39 version (with which I am less familiar) requires using a single instance covering for all of memory, then both the implementation and the interface would need to change to address this. That much seems clear. I think your assessment of the key issue is correct.

Assuming that the TC39 motivation for using a single instance is primarily to polyfill Futex, then the implementation and interface work to address that specific case seems tractable. This being said, if there is scarcely a second use in sight then you should probably just adopt Futex and save the effort.

pizlonator commented 8 years ago

@ogiroux The issue is that shared array buffers are the only objects that will be shared in the initial proposal. Arranging to share a shared array buffer will require a bit of an ordeal for the client. Most likely clients will arrange to only share one shared array buffer, and then this object will encapsulate all shared state.

@lars-t-hansen had a proposal for Synchronic in shared array buffer that links one synchronic object to one shared array buffer, which means that in the envisioned use case of shared array buffers, you'd have false sharing.

So, it's not about the use-case being Futex. The problem is that Synchronic will always impede parallelism so long as the user gets only one. If you did try to make Synchronic more amenable to being shared while preserving data-access parallelism then "notify" will become more expensive, so the client will have to avoid calling it, which means that the client will need to write their own wait loop, which means that the API would be no more intuitive than Futexes.

This situation would be very different if Synchronic objects were also shared. But that seems hard to enable. It's already a big deal that this spec introduces one shared object. Introducing more of them seems like a lot of complexity. Probably the complexity of writing software that shares multiple shared array buffers and/or multiple synchronic objects in order to get parallelism is higher than the complexity of writing software that shares just one shared array buffer and needs to create its own shim around futexes.

Also, I think that if the spec gives futexes as a primitive, then it should also contain informative text that shows an example of a mutex implemented using futexes and suggests that users should use it and that implementers should ensure that this mutex runs fast.

ogiroux commented 8 years ago

It seems the key difference is that user code is allowed to use storage in the shared array buffer for its acceleration data structure, but synchronic is not. This allows a user to roll his own synchronic, effectively, but prevents TC39 from providing one. Have you envisioned that any objects provided by TC39 might want to be allocated space (by the user or system, whichever) into the shared array buffer?

pizlonator commented 8 years ago

@ogiroux I think that the long-term solution is to make everything in ES have some sharing semantics. Then, ES would most likely provide a native Mutex object. This would somewhat obviate the need to have either synchronic or futex, especially if VMs strive to implement their mutexes to be faster than std::mutex.

(As an aside, something that is a major concern to me about the Synchronic work in general is that you're comparing against the std::mutex strawman. System mutexes are really slow, and people who want performance inevitably implement their own. See for example WTF::Lock in WebKit, which is a lot faster than std::mutex in almost every use case, and certainly all of the really important ones like uncontended critical section.)

The specific idea of having the spec allow some ES objects to be allocated inside a shared array buffer is unlikely to be practical. ES currently doesn't make any promises about the shape of objects or even where they reside - they could be in registers, in RAM, on a disk, or on Mars. The spec doesn't promise. This is crucial to ensuring the security properties of ES, which forbid the user from mucking with the VM's bits. For this reason, we're unlikely to allow native ES objects to be placed in array buffers.

ogiroux commented 8 years ago

On the strawman nature of std::mutex : I think I'm fine with that because during this work I have already caused some shame to one vendor, who happened to improve their std::mutex tremendously in the next release. Everyone got a win out of that. Would be nice if that happened on OS X too. :^)

On the allocation idea : I didn't mean to say that the synchronic object needed to live in the shared array, but I wanted to point out that the user's one special ability was the ability to use words in that memory. It might make for an ugly API but as a thought experiment, imagine if calls of synchronic member functions were provided with locations in that memory to use for acceleration. Then the overarching goals could still be met within the envelope -- but until someone has a clever idea the API would be ugly.

pizlonator commented 8 years ago

@ogiroux I think that such an API would be harder to use than the futex API.

jfbastien commented 8 years ago

IIUC @pizlonator your main objection is that the spec currently limits 1 synchronic per SAB?

A developer can of course create multiple SABs, but that's silly. What about SAB+int?

pizlonator commented 8 years ago

-Filip

On Mar 22, 2016, at 4:22 PM, JF Bastien notifications@github.com wrote:

IIUC @pizlonator your main objection is that the spec currently limits 1 synchronic per SAB?

Even if this objection is addressed, we still have the issue that synchronics lead to slower mutexes than futexes do. A developer can of course create multiple SABs, but that's silly. What about SAB+int?

How is that better or different than futexes?

Are you proposing that every client will preallocate a pool of synchronics and then hash mutex addresses to select some synchronic? That sounds like what futexes do for you already, except than futexes do it a lot better.

If you have N synchronics then you have to choose between some ugly solutions:

1) every call to lock/unlock hashes the lock address to compute a synchronic index. This slows down uncontended critical sections, probably by a lot more than we want.

2) use synchronics like you use futexes: only access the synchronic on the slow path of your own locking algorithm. This eliminates any advantages of synchronic since now the client must write their own wait loop.

Both approaches require the client to manage their own concurrent hashing of synchronics, which is something that futexes do for you. The first approach is hella slow and the second approach requires the client to implement their own futex-style concurrent hashing.

I still believe that this is the wrong approach. We should stick with futexes.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

ogiroux commented 8 years ago

I join in doubt that the approach strawman just outlined is a good one.

One would probably either allocate exactly as many instances of synchronic state as needed (however that is done with the api) or would treat it the same as Futex by e.g. considering notify() to be expensive. It follows that if you can only have one copy of synchronic state and that is not the exact number you need, then you would treat it as being expensive.

There is a way to tell synchronic that you think it's expensive already in the C++ paper p0126r2 (note: this is the one part that is handled slightly differently in the implementation, which I think is further to the benefit of this type of use actually). When you use this option, it basically becomes a pass-through to Futex on platforms that have it. On platforms that don't have Futex, it has the freedom to desugar into something that is still less than Futex polyfill at least in part because "notify one" is always only a hint for synchronic. On your platform it could literally use your parking lot.

Still, it's better if you can have enough state so that you can leave it on full-automatic. It's too bad that is difficult to do with the TC39 proposal.

taisel commented 8 years ago

What I think the underlying issue here is that Synchronics is only one tool, and is not always the best tool to use. That being said, there are numerous cases where it is the best tool, but the older, more familiar futex and mutex tools have their own use cases as well.

The question should be then, what is there against having more than one "tool" available? I see the primary way to avoid misuse is by presenting properly made tutorials on these APIs, rather than by limiting the amount of tools in the tool shed you can use.

lars-t-hansen commented 8 years ago

The issue that the SharedArrayBuffer is the only shared object type was known from the start, see my earliest remarks on this bug and the reason why I was resistant to synchronics at the time. Futexes were chosen at the outset of the SAB project explicitly because there is no object management to worry about other than picking a word in memory. (The performance issues described above were not a consideration - or at least not to my knowledge - only the fit to the problem.) Futexes were also an obvious good fit for implementing pthreads, as Filip also points out.

The JS proposal for synchronics was developed to stay within the constraint of being able to share only SharedArrayBuffer objects, and not allowing the internal state of the synchronic to be visible in the SharedArrayBuffer. As Filip noted above, this forces an internal representation where data are managed behind the program's back by the implementation of synchronics. For the small thread counts that we're currently seeing this might not have been a terrible problem but it's clearly true that it would not scale well.

I think it's time to put synchronics aside for now and move forward with the futexes we have - they are probably the better fit for the constraints we have.

And I apologize for not making this excursion properly data-driven. One data point is hardly sufficient.

(The discussion opens up a couple of issues around the TC39 futexes too, eg whether the fairness baked into them will get in the way of parallelism and whether it will now be useful, as very early versions of the spec had it, to expose relax/yield/pause type primitives to aid fast locks. But I will open separate issues for those.)

taisel commented 8 years ago

Aww, the API was flawed on a low level? No possibility of providing their benefit over futexes in another proposal, or will that be rolled into futexes as well? I'm still looking forward to platform specific spins being hidden behind an API.

Edit: I agree about leaving futexes in, but I'm wondering what's blocking attempts to cover all optimal threading cases in further API additions if synchronics in the current proposal is unfit.

ogiroux commented 8 years ago

I don't disagree with the outcome here given the context. I would like to end on a summary of why the outcome is right.

For efficient waiting, the combination of user + platform code needs to use some byte(s) of shared memory to decide if the slow path or the fast path should be taken. Also, the association between that state and the synchronization variable has to be O(1), which usually means they are side-by-side. As things stand with SAB only the user can use SAB, and in particular emplace "objects" there.

The consequence is that only the user can implement waiting prior to the slow-path-taken decision. A user could write synchronic on his own, with state in the SAB, but one can't be provided for him.

@taisel: if my conclusion is correct, then it casts a shadow over any prospective encapsulation of waiting that helps programmers use futex well.

@lars-t-hansen: I don't see how you could avoid relax/yield/sleep. You are set on a path where you'll need them. Wouldn't you say?

jfbastien commented 8 years ago

Right, synchronic would need to be an actual JS object, not magically tied to a SAB, to be efficient.

@lars-t-hansen IIUC that wasn't considered because of asm.js? If we ignore asm.js and look at SAB from a pure JS perspective, would it make sense to have synchronic then?

pizlonator commented 8 years ago

@jfairbank @lars-t-hansen I think it comes down to what is easier to reduce to what. You can implement synchronic on top of futex, or you can implement futex on top of synchronic. But if you implement futex on top of synchronic (or, rather, a bunch of synchronics) then unless you do some tricks you will have bad performance. Probably you'll have to implement your own concurrent hash map. On the other hand implementing synchronic on top of futex is less difficult.

From that standpoint, even if asm.js wasn't the primary use case, we would still want futex over synchronic. Futex is the better "kernel" feature, since it's easier to reduce other primitives to it. Synchronic is something you add when you already have other stuff.

(Thinking bigger, I think that in future languages any memory location that is shared should support the following primitives in addition to read/write:

CAS
Fenced load/store
Futex wait
Futex wake

This would mirror the way it works in C if you have a futex library lying around.)

tc39 / proposal-ecmascript-sharedmem

Why not synchronic #12