tc39 / proposal-atomics-microwait

Micro waits in JS
http://tc39.es/proposal-atomics-microwait/
Other
49 stars 1 forks source link

Hint parameter: keep or remove? #9

Closed syg closed 2 months ago

syg commented 4 months ago

In the July TC39, concerns were raised about the hint parameter being confusing.

Motivation

The high-level goal of the hint parameter is a best effort attempt to keep the amount of the time a spin-wait loop would wait the same in the interpreter as it is in the JITs. In the interpreter, JS call overhead is very high, while in the JITs, an Atomics.pause call can be inlined into a single pause or yield instruction.

The intention is that the following loop snippet would execute roughly the same number of pause instructions no matter what tier it's executing in.

for (let i = 0; i < maxSpinCount; i++) {
  if (TryLock()) return;
  Atomics.pause(i);
}

In the interpreter, Atomics.pause(i) would execute only a single pause instruction. Once the loop is optimized by the JITs and the Atomics.pause call inlined into a single pause, the optimizing compiler may choose to use i to execute multiple pause instructions, possibly with a backoff strategy. The backoff strategy is OS and architecture dependent.

In a lower-level language like C++, user code would express any such backoff algorithms directly with inline assembly. But given JS is much higher level with much more variable performance, this is not expressible at the performace granularity that you want.

Option 1: Keep the implementation-defined hint

This option keeps the built-in as proposed.

The parameter iterationNumber is a monotonically increasing integer that hints the VM "how many times has this code paused in this spin-wait loop". The VM can then choose to interpret that integer however it wants: it may decide later iterations should waiter longer; it may decide later iterations should wait shorter; it may ignore it entirely.

There is concern that this is ripe for confusion. What if someone writes a spin-wait loop with a decreasing loop counter, for example?

Option 1a: Specify that the parameter has a monotonic increasing relationship with wait time

To avoid confusion but to keep the hint, we can nail down the expectation: larger hint integers mean longer pause times, with a static implementation-defined static cap.

I want to warn though that even if we have such language that makes the expectation clear, since this is only timing and not observable behavior, technically any interpretation of the hint parameter would still be compliant and allowed as an as-if implementation, since there's no way to test for non-compliance.

Option 2: Remove the hint

This option changes the built-in to Atomics.pause() without any parameters.

The goal above would need to be met by internal heuristics in engines. While not impossible, it will be more prone to performance cliffs. However, the advantage is that user code would not misuse the hint parameter.

Champion's preference

~My preference is Option 1, but I am warming up to Option 2, given the confusion delegates have expressed as a proxy for developer confusion.~

Honestly I'm okay with any of the options. Mild preference for Option 1 or 1a?

cc @waldemarhorwat @msaboff @kmiller68 @rbuckton

rbuckton commented 4 months ago

To clarify how this works in .NET, Thread.SpinWait(iterations) is the closest parallel to Atomics.pause(iterationCount). Thread.SpinWait() essentially performs a tight loop of roughly n iterations that issue platform/architecture-specific yield or pause instructions to the CPU per each iteration. This is generally very loosely specified as the actual implementation details are dependent on whether the system supports multiple cores, and whether the CPU architecture supports yield, pause, Zihintpause, rep; nop, dbar 0, or some other equivalent.

In .NET, any backoff calculations are handled by a separate SpinWait struct that maintains is own internal spin counter and can periodically perform either yield or Thread.Sleep() operations based on how many times you spin, and calls Thread.SpinWait() under the covers.

The simplest choice is to match something like Thread.SpinWait(iterations), where a larger iterations value indicates a longer wait, and leave any backoff calculation to user code. This aligns with option 1a, above, as well as #11.

msaboff commented 4 months ago

I have a few concerns with the proposed Atomics.pause([interationNumber]) method and its parameter.

First off the interationNumber parameter would likely have different semantics in each implementation as well as in different tiers or on different CPUs for the same implementation.

Then there is the Atomics.pause() method itself. (Sorry, I must not have been awake / attentive during the last meeting when it was proposed). It implies that the method would cause either X86 pause instructions or the somewhat equivalent ARM 'yield" instruction. Although they are designed for similar uses, these instructions are not quite the same. IIUC, the pause instruction was initially designed to delay hyper thread switching and later they added nop semantics with a defined and long execution time. On ARM, yield is a hint that higher level software might use to switch threads. In WebKit, we call thread_switch() or sched_yield() instead of executing pause or yield instructions in various spin lock contended slow paths. Also, with pause, yield or the various flavors of thread switching, we are telling the CPU / OS to at least momentarily switch away from the current, i.e. main thread.

Again, I have the concern that the use of this method might become more of a detriment than an asset. Given the confusion during the plenary meeting today, how can we expect that a JS dev will be able to effectively use this method such that is a benefit across multiple implementations?

If I had to pick among the presented options, my first choice is option 2, then much lower in my mind option 1a. I am totally against option 1 as it is counter intuitive and provides no normative behavior as to how to use the parameter both in an implementation or as a JS dev. Again, WebKit uses something like option 2, with various backoff strategies for contended locks. After yielding to another thread, we check something before yielding again. I'm not sure what to do with Atomics.pause([interationNumber]) if it would map to yield interationNumber times, i.e. switch away from this thread N times and then continue execution.

For the method itself, do the use cases your thinking of use methods beside Atomics.compareExchange() and one of the Atomics.wait() variants with the pause() method? If so, that would be good to know. If not, why not have higher level methods like Atomics.spinlock() and/or Atomics.lock()?

syg commented 4 months ago

IIUC, the pause instruction was initially designed to delay hyper thread switching and later they added nop semantics with a defined and long execution time. On ARM, yield is a hint that higher level software might use to switch threads. In WebKit, we call thread_switch() or sched_yield() instead of executing pause or yield instructions in various spin lock contended slow paths. Also, with pause, yield or the various flavors of thread switching, we are telling the CPU / OS to at least momentarily switch away from the current, i.e. main thread.

The ARM docs reads to me as exactly analogous to pause. AFAIU hinting the CPU is different than hinting the OS. This API is about hinting the CPU that the current thread is hammering on some particular piece of memory to try to get a lock, so the the CPU release its hardware locks on some memory units without giving up occupancy of the core and actually switch the executing thread. Hinting the OS is more about actually switching the executing thread.

In WebKit, we call thread_switch() or sched_yield() instead of executing pause or yield instructions in various spin lock contended slow paths.

Interesting! Then I suppose JSC on Apple M-chips would do nothing for Atomics.pause() if it has no benefit. Chrome's PartitionAlloc's SpinningMutex does both the CPU yield (this proposal) and sched_yield. First it pauses the CPU while trying to acquire. If it can't acquire lock after a bounded number of pauses, then it yields the thread. If it can't acquire the lock after a bounded number of thread yields, it puts the thread to sleep.

I'm not sure what to do with Atomics.pause([interationNumber]) if it would map to yield interationNumber times, i.e. switch away from this thread N times and then continue execution.

As above, I think if JSC sees no performance benefit in a yield instruction on Apple Silicon, the thing to do would be to do nothing for Atomics.pause, not map it to a thread yield, since that's not the intended use. That said I'm also not against exposing an Atomics.yield.

For the method itself, do the use cases your thinking of use methods beside Atomics.compareExchange() and one of the Atomics.wait() variants with the pause() method? If so, that would be good to know. If not, why not have higher level methods like Atomics.spinlock() and/or Atomics.lock()?

Because there are other kinds of spinwait loops, mainly, and Atomics is already about low-level building blocks. React has also spoken up about wanting this method -- and I don't think they're writing a traditional futex-like spinlock.

Wouldn't a Atomics.spinlock() need to encompass all the different backoff algorithms, different kinds of mutex implementations, etc. That seems like a strictly higher effort API to design for something the main user (Emscripten) wants to do manually, which seems like a poor fit.

If I had to pick among the presented options, my first choice is option 2, then much lower in my mind option 1a.

My read of your concern is the potential for confusion between the CPU pause and the OS yield. WDYT about expanding the scope here slightly to Atomics.pause([N]) and Atomics.yield()? There's in fact #6 already that's asking if we should have both.

syg commented 4 months ago

Ah, and the main question:

Again, I have the concern that the use of this method might become more of a detriment than an asset. Given the confusion during the plenary meeting today, how can we expect that a JS dev will be able to effectively use this method such that is a benefit across multiple implementations?

I expect the JS dev here to be an expert, just like the intended audience of the rest of the Atomics methods. I would give a higher likelihood to JS devs using this correctly than using the futex API (Atomics.wait) correctly, actually.

msaboff commented 4 months ago

If I had to pick among the presented options, my first choice is option 2, then much lower in my mind option 1a.

My read of your concern is the potential for confusion between the CPU pause and the OS yield. WDYT about expanding the scope here slightly to Atomics.pause([N]) and Atomics.yield()? There's in fact #6 already that's asking if we should have both.

I don't think expanding the number of methods improve things. Given that you suggest JSC turned Atomics.pause([N]) into a nop and I assume then you'd suggest Atomics.yield() be a call to a thread yield OS call for JSC, what would V8 or other implementations do with Atomics.yield()? What would we recommend to JS devs? Seems like this confuses the situation even further.

Ah, and the main question:

Again, I have the concern that the use of this method might become more of a detriment than an asset. Given the confusion during the plenary meeting today, how can we expect that a JS dev will be able to effectively use this method such that is a benefit across multiple implementations?

I expect the JS dev here to be an expert, just like the intended audience of the rest of the Atomics methods. I would give a higher likelihood to JS devs using this correctly than using the futex API (Atomics.wait) correctly, actually.

Expert JS devs could benefit from this method, but the confusion in plenary yesterday certainly suggests that a significant population of JS devs would get it wrong.

syg commented 4 months ago

I don't think expanding the number of methods improve things. Given that you suggest JSC turned Atomics.pause([N]) into a nop and I assume then you'd suggest Atomics.yield() be a call to a thread yield OS call for JSC, what would V8 or other implementations do with Atomics.yield()? What would we recommend to JS devs? Seems like this confuses the situation even further.

I expect all VMs to implement Atomics.yield as a thread yield. What we recommend to JS devs is the same recommendation that sched_yield carries: it's a hint to the OS and may do nothing depending on the scheduler. I don't think this is confusing for people working at that level. I agree it is confusing for someone unfamiliar with the space, but that's not the audience, right? Same for the futex API.

Expert JS devs could benefit from this method, but the confusion in plenary yesterday certainly suggests that a significant population of JS devs would get it wrong.

The nature of the confusion in plenary was about the relationship between the parameter and the wait time, which both option 1a and option 2 solve. It does mean option 1 is off the table.

rbuckton commented 4 months ago

IIUC, the pause instruction was initially designed to delay hyper thread switching and later they added nop semantics with a defined and long execution time. On ARM, yield is a hint that higher level software might use to switch threads. In WebKit, we call thread_switch() or sched_yield() instead of executing pause or yield instructions in various spin lock contended slow paths. Also, with pause, yield or the various flavors of thread switching, we are telling the CPU / OS to at least momentarily switch away from the current, i.e. main thread.

The reason you busy-wait in a spin-lock is to avoid switching away from the current thread when you're attempting to resolve contention quickly, as context switches are slow. The worst-case busy-wait is something like for(let i = 0; i < iterations; i++);, which is probably fine if you have many cores and no hyper-threading, but that causes starvation on that core. The point of an API method is that it can perform the same function as the for-based busy wait but also send hints to the CPU that can allow it to schedule work more efficiently when hyper-threading is enabled. Ideally, you spin using Atomics.pause(iterations) rather than for to be a better neighbor in HT, and let the implementation determine how best to be a friendly neighbor based on the underlying architecture.