Replace spinocks by mutexes in runtime modules

slomp commented 1 year ago

Title says all.

d3d12: https://github.com/halide/Halide/pull/7489 metal: https://github.com/halide/Halide/pull/7532

cuda: opencl: webgpu:

gpu_device_selection: tracing:

steven-johnson commented 1 year ago

Adding some more information about why would be helpful here -- do we have data showing it would be a clear improvement? If so, why did we use spinlocks in the first place?

abadams commented 1 year ago

I believe when the spinlocks were used we didn't have good fine-grained lock primitives in the runtime and were dependent on heavier-weight OS locks. halide_mutex now adaptively spins, so it should be fine to replace spinlocks with it.

slomp commented 1 year ago

Spinlocks in user-space are generally a bad idea, unless you are on a "real-time" priority scenario and can more or less ensure you won't be preempted, wich is never really the case in time-sharing operating systems. Also, the spinlock implementation in scoped_spin_lock.h is super naive. It does not account for branch misprediction, fairness, excessive cacheline write contention, and so on. As Andrew pointed, it should be used as a last resource.

There's even a recent post/rant from Linus Torvalds about spinlocks in userspace: https://www.realworldtech.com/forum/?threadid=189711&curpostid=189723

zvookin commented 1 year ago

Spin locks are used to implement mutexes in fast_sync, so for that case, we need to replace them with futexes, where futexes are available. Probably also worth looking at the recent switch_to stuff in the Abseil mutex implementation for Linux.

The main reason for spinlocks originally was mutexes required allocation and initialization, and effectively a heavier dependency. Our mutex implementation is a fixed size small value that is zero initialized. We should be able to make our mutex implementation superior to spinlocks in all cases that are under the same OS domain. (A spinlock, or other polled mechanism, may still be the only option at the ISA level between say a CPU and a GPU, or a DSP and a GPU where there is only shared memory available as an abstraction.)

slomp commented 1 year ago

@zvookin Would it be possible to use the halide_mutex that scoped_mutex_lock.h uses? https://github.com/halide/Halide/blob/d03140864ad4b7816b89f450a90f5c0e1effad57/src/runtime/scoped_mutex_lock.h#L12

@abadams mentioned to me that it implements an adaptive mutex (spins just for a bit, then defers to a mutex). CRITICAL_SECTION on Windows, and futex on Linux are both adaptive mutexes (mutices?) already, AFAIK.

zvookin commented 1 year ago

Yes, halide_mutex is what I'm referring to as a thing that spins and should use something else, at least on some OSes, but obviously can't use itself :-). I think everything else can use halide_mutex. (And the reason is because it is a single pointer in size and zero initialized. That's most of the reason we have fast_sync.)

I'm told that on modern mutex implementations for Linux, there is no spinning on the futex. If a locking operation is contended, the code immediately enters a path the OS is aware of. (I doubt it goes straight to a system call of course.)

I hope to revisit the fast_sync implementation at some point before the heat death of the universe, but would also be happy to collaborate with someone else on improving it. There is also the long standing parallel execution issue. Ideally I'd like Halide to have best in class support for all of this. There are possibilities at the language level as well. E.g. a scheduling directive giving an explicit amount of parallelism to apply to a given loop.

slomp commented 1 year ago

Yeah, I am also not entirely convinced that spinning even if for just a bit is a good idea.

My understanding is that the main benefits of an adaptive mutex comes from keeping the "lock" in user-space as to avoid context-switch just to acquire it when it is not being held. In uncontended scenario, the lock can be acquired in user-space without any OS intervention. When there's contention, well, it's probably best to just let the OS scheduler know that we're waiting on a particular object and let the scheduler help us with that.

There's really no good general way to come up with an educated guess as to how many spins would be needed on average for an arbitrary case, so, when in doubt, why not 0 spins? Or is there a good heuristic that has been tested in all sorts of workloads and systems that I am unaware of?

Anyway, I think we should only offer halide_mutex and if a particular target OS feature does not support mutex, we can implement it under the hood as a spinlock. It can get nasty, as there's no real portable way to implement a "good" spinlock (something that is fair, does not throttle the cache line ownership, gives the CPU a chance to "breath" a bit, and does not incur a massive branch misprediction when the lock is finally acquired).

abadams commented 1 year ago

It's easy enough to change the current implementation and rebenchmark. E.g. if I remove spinning from our mutex the mullapudi2016_histogram test gets a full 4x slower (!) for both the manual and automatic schedule. I picked that benchmark because it uses fine-grained parallelism. Looking at the less extreme local laplacian app, removing spinning increases runtime from 26ms to 30ms for the manual schedule and from 33ms to 81ms for the automatic schedule. This is all on ubuntu.

The big problem with our mutex is that it does a full sched_yield in the spin loop, which is great if you're the only one using the system and are trying to produce impressive benchmarking numbers, but a massive waste of CPU cycles in production scenarios. Doing something like replacing the sched_yield with the x86 pause instruction totally craters performance, giving similar slowdowns to the above.

abadams commented 1 year ago

Also for reference, I believe the current system is based on this: https://webkit.org/blog/6161/locking-in-webkit/

slomp commented 1 year ago

Yup, I did browse that WebKit blog post, but the microbenchmarks seemed a bit contrived and artificial.

if I remove spinning from our mutex the mullapudi2016_histogram test gets a full 4x slower (!) for both the manual and automatic schedule.

Ok, spinning it is, then! Did you try to different spin count values too? One parallel thread per core (no oversubscription)? No other processes competing for the CPU while running the histogram program?

The big problem with our mutex is that it does a full sched_yield in the spin loop.

That's a good point; Linus even mentions on the forum message I referenced:

adding random "sched_yield()" calls while you're spinning on the spinlock will not really help. It will easily result in scheduling storms while people are yielding to all the wrong processes.

Spinning is hard. Coming up with a good benchmark for it is even harder.

abadams commented 1 year ago

Yeah one thread per core and no other work happening on the machine. I have not tried different spin count values. Currently it's 40.

It's frustrating to me that the correct thing to do is apparently a lot slower on a meaningful whole-algorithm benchmark (local laplacian), than the thing Linus says is better. Reality isn't matching theory and I'm not sure why.

slomp commented 1 year ago

Linus is probably basing his claims on more grounded real world scenarios reported in the wild, when there are other "busy" processes competing for a slice of the scheduler.

I suppose doing a bit of spinning is fine for us since in the end we do end up doing the responsible thing and calling the os for help. Any process using Halide is sort of expected to be CPU intensive at times and users will probably not have many other competing processes running along.

Let's just change the spin count from 40 to 42 ;)

In all seriousness, I wonder how spinning with pause for a bit, then a few more times with yield, before calling the mutex, would perform

slomp commented 1 year ago

For what is worth, I found this article on spinlocks from the perspective of real-time audio synthesis: https://timur.audio/using-locks-in-real-time-audio-processing-safely

The article describes an implementation of a spinlock with exponential back-off that is allegedly suitable for "real-time" processing. The context is well defined and explained: one real-time priority thread for the audio synthesis and one normal priority thread for parameter changes and such, running on a system without any other processes contending for the scheduler and cores; only the normal priority thread can spinlock, while the real-time thread only tries to lock and moves on to do something else when it fails. This is quite different than our case here with something like mullapudi2016_histogram, but maybe there are some interesting tidbits there to take home.

abadams commented 11 months ago

I'm trying to triage outstanding correctness bugs, so I'm removing bug tag because this doesn't crash or produce incorrect output. Let's use the "performance" for what we might call "performance bugs".

halide / Halide

Replace spinocks by mutexes in runtime modules #7485