smarty-prototypes / go-disruptor

A port of the LMAX Disruptor to the Go language.
Apache License 2.0
1.37k stars 212 forks source link

Eliminate race condition in the AMD64 cursor #5

Closed kellabyte closed 5 years ago

kellabyte commented 6 years ago

The cursor implementation in cursor_amd64.go does not contain any atomic operations that you see in cursor_386.go or cursor_ARM.go. It's relying on atomic instruction guarantees that AMD64 CPU's provide. The problem is the Go compiler can apply optimizations that could make relying on this unsafe.

With optimizations we have no guarantee that the value will be stored when we call Cursor.Store() or even that Cursor.Read() is always reading the latest value. Cursor.Read() may load it once in a register and never load memory again. The code might work now, but future improvements to the gc may cause this code to break

The atomic Store/Load operations guarantee the memory write / read operations are linearizable, allowing the value to be observed by concurrent goroutines.

Proposal

I think cursor_AMD64.go should include atomic operations. If I make the cursor_AMD64.go the same as the others I see what I consider acceptable performance change for something that we can rely on being safe. I think for safety purposes atomic operations should be used.

I'm would be happy to provide a PR with the changes.

Benchmarks

Benchmarks were run 10 times and compared with benchstat.

go test -bench=. -count 10 >before.log
go test -bench=. -count 10 >after.log
benchstat before.log after.log
name                                     old time/op    new time/op    delta
SharedWriterReserveOneContendedWrite-8     66.3ns ± 2%    65.8ns ± 0%    -0.76%  (p=0.014 n=9+8)
SharedWriterReserveManyContendedWrite-8    3.45ns ± 1%    3.39ns ± 1%    -1.65%  (p=0.000 n=10+10)
SharedWriterReserveOne-8                   23.8ns ± 1%    22.5ns ± 1%    -5.50%  (p=0.000 n=10+10)
SharedWriterReserveMany-8                  2.85ns ± 1%    2.69ns ± 2%    -5.54%  (p=0.000 n=10+10)
WriterAwaitOne-8                           6.35ns ± 3%   14.43ns ± 1%  +127.32%  (p=0.000 n=10+10)
WriterAwaitMany-8                          0.89ns ± 0%    1.48ns ± 1%   +66.71%  (p=0.000 n=8+8)
WriterReserveOneMultipleReaders-8          12.6ns ± 0%    16.0ns ± 1%   +26.09%  (p=0.000 n=10+10)
WriterReserveManyMultipleReaders-8         6.71ns ±25%    8.63ns ±12%   +28.71%  (p=0.000 n=10+10)
WriterReserveOne-8                         10.5ns ± 5%    15.4ns ± 1%   +47.58%  (p=0.000 n=10+9)
WriterReserveMany-8                        1.02ns ± 1%    1.50ns ± 3%   +47.98%  (p=0.000 n=10+10)

name                                     old alloc/op   new alloc/op   delta
SharedWriterReserveOneContendedWrite-8      0.00B          0.00B           ~     (all equal)
SharedWriterReserveManyContendedWrite-8     0.00B          0.00B           ~     (all equal)
SharedWriterReserveOne-8                    0.00B          0.00B           ~     (all equal)
SharedWriterReserveMany-8                   0.00B          0.00B           ~     (all equal)
WriterAwaitOne-8                            0.00B          0.00B           ~     (all equal)
WriterAwaitMany-8                           0.00B          0.00B           ~     (all equal)
WriterReserveOneMultipleReaders-8           0.00B          0.00B           ~     (all equal)
WriterReserveManyMultipleReaders-8          0.00B          0.00B           ~     (all equal)
WriterReserveOne-8                          0.00B          0.00B           ~     (all equal)
WriterReserveMany-8                         0.00B          0.00B           ~     (all equal)
dallbee commented 6 years ago

It's worse than that: there's similar assumptions in other locations in this code base. go test -bench=. -race reveals data races in cursor, reader, shared_disruptor, and shared_writer.

I started trying to fix them all, but it turned out to be a bit of a rabbit hole. I'm concerned that this pattern may lose most of its advantages due to the only guarantee the go compiler makes being the happens-before relationship within a goroutine.

I may try a smaller proof of concept which involves only the spsc case and see if I can recreate the memory barrier without resorting to CAS operations and satisfy the race detector.

lthibault commented 5 years ago

What's the word on this?

joliver commented 5 years ago

We are not actively pursuing this project. We welcome pull requests.

lthibault commented 5 years ago

@joliver Noted, thanks :)

Nevertheless, @dallbee, I'm unable to replicate the race condition you mention. Has this been fixed in the meantime or am I just not getting "lucky" with Go's race detector?

dallbee commented 5 years ago

@lthibault I can no longer reproduce the behavior on go 1.11 against Linux 4.19.8-arch1-1-ARCH x86_64.

joliver commented 5 years ago

@kellabyte @dallbee I am interested in resurrecting this project and I'm comfortable with the pattern. The pain point that I'm running into is the "happens-before" guarantees between go-routines without using assembly-style memory fences. I'd love to figure out how to better work with lfence and rfence so that I don't create weird concurrency bugs.

joliver commented 5 years ago

@kellabyte I went back to basics for a bit and stripped out all the excess. I believe I've got it working with a single writer and sets of consumer groups.

A consumer group is a set of consumers that can work independently of each other at the same time.

When there are multiple groups of consumers, each set gates on the previous set.

joliver commented 5 years ago

By completely removing the SharedWriter, I believe I have resolved all of the race conditions in the code. I'm stilling looking at ways to implement the SharedWriter in a way that's compatible with the Go memory model and without me having to write CPU-specific assembly code.

For now, I'm going to consider this issue complete. Again, I'm more than happy to have a discussion about how to implement various structures to optimize the implementation further and to work out any kinks.