smol-rs / concurrent-queue

Concurrent multi-producer multi-consumer queue
Apache License 2.0
254 stars 22 forks source link

Fix fence on non-x86 arch and miri #16

Closed taiki-e closed 2 years ago

taiki-e commented 2 years ago

The problem seems to be that the original author of this code confused fence in the x86 hardware memory model with atomic fence in the C++ memory model. (On x86, lock cmpxchg; mov (load from memory) is fine. See also https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. On C++ memory model and many architectures, fence for load should be load; fence)

Fixes https://github.com/bevyengine/bevy/issues/5164 FYI @cbeuw

taiki-e commented 2 years ago

At least crossbeam and event-listener also have the same issue, but fixing them is probably more complex...

sbarral commented 2 years ago

I feel a bit uncomfortable with this commit.

Admittedly, I don't know what exactly is the role of the fence here. This fence does not exist in Dmitry Vyukov's original implementation of the queue, so I guess it was added as part of the modifications that ensure that this queue is linearisable (unlike the original queue).

That being said, if the cross-platform solution is indeed to place the load before the fence (this, I do not know) then I am pretty sure that the intel specialization that uses a lock operation instead of an mfence should also keep the load before.

I did look at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html but could not see where it states that lock + mov (in this order) is equivalent to mov + mfence. In fact, the latest GCC does use the lock optimization and definitely preserves the order, i.e. mov + lock (see this godbolt: https://godbolt.org/z/o3rYdTvYv).

RalfJung commented 2 years ago

That being said, if the cross-platform solution is indeed to place the load before the fence (this, I do not know) then I am pretty sure that the intel specialization that uses a lock operation instead of an mfence should also keep the load before.

I would usually expect that to be the case -- a relaxed load followed by an acquire-or-stronger fence can induce a synchronization edge. But I don't know the context for this particular code.

Does something break, or perf go down badly, if the fence is moved after the load?