Use process wide memory barriers for GC

vargaz commented 5 years ago

Right now, our gc related code, i.e. allocators/handles are full of memory barriers in order to support preempt mode. Instead of these, we could use compiler memory barriers (volatile), and have the GC thread do a process wide memory barrier: https://lore.kernel.org/patchwork/patch/188708/ https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

This will speed up and simplify the implementation of that code.

lewurm commented 5 years ago

In addition to that it would be nice to get rid of in_critical_region in SgenClientThreadInfo that is a set (+memory barrier) and unset (+memory barrier) in the allocator for each allocation.

set: https://github.com/mono/mono/blob/fd82c6036ec3720d568050c1e538f9729cef984e/mono/metadata/sgen-mono-ilgen.c#L348-L354 unset on slowpath before calling into runtime code: https://github.com/mono/mono/blob/fd82c6036ec3720d568050c1e538f9729cef984e/mono/metadata/sgen-mono-ilgen.c#L425-L431

unset on fastpath merged with publish of allocated object. We probably can't get rid of that completely: https://github.com/mono/mono/blob/fd82c6036ec3720d568050c1e538f9729cef984e/mono/metadata/sgen-mono-ilgen.c#L502-L514

I was thinking that in_critical_region is not necessary anymore with hybrid/coop suspend, but I'm not 100% sure on that yet.

filipnavara commented 5 years ago

Note that the membarrier syscall may not be available on older systems. This is likely to affect some of the ARM hardware that you are running. Additionally, it is not available on Android devices until Android 10 (https://github.com/aosp-mirror/platform_bionic/commit/6ba6694ec24052de9fdfa8a51c148aebe68a9fd0#diff-dc7191156f6b327ad0d4bae8f0e1a803) and it is explicitly blocked by SELinux on older versions even if the kernel supports it. liburcu has some alternatives for systems where membarrier is problematic but the performance penalty could be too high.

BrzVlad commented 5 years ago

I'm pretty sure we already make the assumption that when a thread is suspended, the suspending mechanism makes sure to have all the memory writes flushed. Otherwise card table marking would have never worked.

I don't see any memory barriers when storing in_critical_region (MONO_MEMORY_BARRIER_NONE!?). I agree that when using coop it is of no use. The only memory barrier that I'm aware of inside the allocator is the one used to publish the object contents, before returning it, which is probably needed only because outside code might be juggling with these references in a lock free fashion. I think this was the memory model that we used to provide, if you juggled with objects between threads in a lock free fashion we wouldn't internally crash due to invalid vtable, but that was it. Maybe we were doing this with objects also inside the runtime.

Can mono_handle_new run concurrently with a GC ? I'm guessing not and that those membars are not needed. However, if the thread enters GC Safe state, and the GC can just kick in at any time, there should be a memory barrier to flush all writes to the handle stack.

lambdageek commented 5 years ago

Can mono_handle_new run concurrently with a GC ?

(With preemptive suspend, I assume)

The code was meant to be resilient to being suspended at any point during mono_handle_new, so that when the GC is scanning the handle stack it either sees valid object references or handles with a NULL, never junk integers. But I assumed that if we're scanning the handle stack the thread is suspended at that point (ie we scan during a STW).

Although now that we switched to scanning the handle stack conservatively, it's less important that there are non-junk values there.

lambdageek commented 5 years ago

However, if the thread enters GC Safe state, and the GC can just kick in at any time, there should be a memory barrier to flush all writes to the handle stack.

They shouldn't be calling mono_handle_new from a GC Safe state. They need to switch to unsafe first (which will safepoint).

BrzVlad commented 5 years ago

Well, I assume only registered mutator threads should ever push to the handle stack, and all mutator threads are suspended during GC. So this means it can't run concurrently with a GC. And if the thread is suspended, based on what I said above, the memory of the suspended thread should have been flushed, regardless of those memory barriers.

lambdageek commented 5 years ago

Ok, that makes sense. In that case the barriers in mono_handle_new are not needed.

the suspending mechanism makes sure to have all the memory writes flushed.

So on mach and win32 we assume that suspend_thread and SuspendThread/GetThreadContext do that? And if we're using signals for suspend then signal delivery ought to do it? or should we do something explicitly in the signal handler?

BrzVlad commented 5 years ago

When using signals, we synchronize with some semaphores so we are ok. On mach and win32 we probably just assume, I don't think there is explicit documentation there, but it makes sense to be that way and we would have run into trouble long ago if it weren't the case.

vargaz commented 5 years ago

The code still needs at least compiler memory barriers. Also, I think the current barriers in the allocators etc. were added because of failures, I don't remember whenever the failures were due to compiler or cpu reordering.

BrzVlad commented 5 years ago

We could replace membars from handles code with compiler barriers, just to be sure.

For allocators, it's risky to remove membarriers altogether because I'm sure there are places in the runtime where we might pass around such objects without synchronisation. Allocators can't be interrupted by stw so it had to be cpu reordering.

mono / mono

Use process wide memory barriers for GC #17465