open-simh / simh

The Open SIMH simulators package
https://opensimh.org/
Other
482 stars 91 forks source link

ThreadSanitizer: data race #253

Closed bscottm closed 11 months ago

bscottm commented 1 year ago

Reported in the vax730 simulator, and occurs in others while running the thread sanitizer on Ubuntu:

==================
WARNING: ThreadSanitizer: data race (pid=16938)
  Write of size 4 at 0x5587111acd78 by thread T3 (mutexes: write M0):
    #0 sim_aio_activate /home/runner/work/open-simh/open-simh/scp.c:435 (vax730+0x10d1e0)
    #1 _sim_activate /home/runner/work/open-simh/open-simh/scp.c:12007 (vax730+0x13e3ba)
    #2 sim_activate /home/runner/work/open-simh/open-simh/scp.c:11999 (vax730+0x13e34b)
    #3 _disk_io /home/runner/work/open-simh/open-simh/sim_disk.c:[266](https://github.com/bscottm/open-simh/actions/runs/5245980425/jobs/9474192319#step:6:266) (vax730+0x16c97e)
  Previous write of size 4 at 0x5587111acd78 by main thread:
    #0 sim_instr /home/runner/work/open-simh/open-simh/VAX/vax_cpu.c:658 (vax730+0x3818b)
    #1 run_cmd /home/runner/work/open-simh/open-simh/scp.c:9319 (vax730+0x1328ec)
    #2 vax730_boot /home/runner/work/open-simh/open-simh/VAX/vax730_sys.c:484 (vax730+0x89ef7)
    #3 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c:4210 (vax730+0x11795b)
    #4 call_cmd /home/runner/work/open-simh/open-simh/scp.c:5630 (vax730+0x11e65f)
    #5 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c:4210 (vax730+0x11795b)
    #6 do_cmd /home/runner/work/open-simh/open-simh/scp.c:4034 (vax730+0x115ede)
    #7 main /home/runner/work/open-simh/open-simh/scp.c:[293](https://github.com/bscottm/open-simh/actions/runs/5245980425/jobs/9474192319#step:6:293)8 (vax730+0x10e7ff)
  Location is global 'sim_asynch_check' of size 4 at 0x5587111acd78 (vax730+0x00000038dd78)
  Mutex M0 (0x7b4800000088) created at:
    #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1220 (libtsan.so.0+0x4a616)
    #1 sim_disk_set_async /home/runner/work/open-simh/open-simh/sim_disk.c:594 (vax730+0x16e764)
    #2 _sim_disk_io_flush /home/runner/work/open-simh/open-simh/sim_disk.c:956 (vax730+0x171212)
    #3 sim_disk_reset /home/runner/work/open-simh/open-simh/sim_disk.c:3615 (vax730+0x17f015)
    #4 rq_reset /home/runner/work/open-simh/open-simh/PDP11/pdp11_rq.c:3170 (vax730+0xa3a4c)
    #5 rq_wr /home/runner/work/open-simh/open-simh/PDP11/pdp11_rq.c:1516 (vax730+0x9589d)
    #6 WriteUb /home/runner/work/open-simh/open-simh/VAX/vax730_uba.c:[308](https://github.com/bscottm/open-simh/actions/runs/5245980425/jobs/9474192319#step:6:308) (vax730+0x8bed7)
    #7 WriteIO /home/runner/work/open-simh/open-simh/VAX/vax730_uba.c:359 (vax730+0x8c0d9)
    #8 WriteW /home/runner/work/open-simh/open-simh/VAX/vax_mmu.h:390 (vax730+0x36af5)
    #9 Write /home/runner/work/open-simh/open-simh/VAX/vax_mmu.h:211 (vax730+0x36098)
    #10 sim_instr /home/runner/work/open-simh/open-simh/VAX/vax_cpu.c:1612 (vax730+0x4046b)
    #11 run_cmd /home/runner/work/open-simh/open-simh/scp.c:9319 (vax730+0x1328ec)
    #12 vax730_boot /home/runner/work/open-simh/open-simh/VAX/vax730_sys.c:484 (vax730+0x89ef7)
    #13 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c:4210 (vax730+0x11795b)
    #14 call_cmd /home/runner/work/open-simh/open-simh/scp.c:5630 (vax730+0x11e65f)
    #15 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c:4210 (vax730+0x11795b)
    #16 do_cmd /home/runner/work/open-simh/open-simh/scp.c:4034 (vax730+0x115ede)
    #17 main /home/runner/work/open-simh/open-simh/scp.c:2938 (vax730+0x10e7ff)
  Thread T3 (tid=16957, running) created by main thread at:
    #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x5ea79)
    #1 sim_disk_set_async /home/runner/work/open-simh/open-simh/sim_disk.c:601 (vax730+0x16e7fd)
    #2 _sim_disk_io_flush /home/runner/work/open-simh/open-simh/sim_disk.c:956 (vax730+0x171212)
    #3 sim_disk_reset /home/runner/work/open-simh/open-simh/sim_disk.c:3615 (vax730+0x17f015)
    #4 rq_reset /home/runner/work/open-simh/open-simh/PDP11/pdp11_rq.c:3170 (vax730+0xa3a4c)
    #5 rq_wr /home/runner/work/open-simh/open-simh/PDP11/pdp11_rq.c:1516 (vax730+0x9589d)
    #6 WriteUb /home/runner/work/open-simh/open-simh/VAX/vax730_uba.c:308 (vax730+0x8bed7)
    #7 WriteIO /home/runner/work/open-simh/open-simh/VAX/vax730_uba.c:359 (vax730+0x8c0d9)
    #8 WriteW /home/runner/work/open-simh/open-simh/VAX/vax_mmu.h:390 (vax730+0x36af5)
    #9 Write /home/runner/work/open-simh/open-simh/VAX/vax_mmu.h:211 (vax730+0x36098)
    #10 sim_instr /home/runner/work/open-simh/open-simh/VAX/vax_cpu.c:1612 (vax730+0x4046b)
    #11 run_cmd /home/runner/work/open-simh/open-simh/scp.c:9319 (vax730+0x1328ec)
    #12 vax730_boot /home/runner/work/open-simh/open-simh/VAX/vax730_sys.c:484 (vax730+0x89ef7)
    #13 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c:4210 (vax730+0x11795b)
    #14 call_cmd /home/runner/work/open-simh/open-simh/scp.c:5630 (vax730+0x11e65f)
    #15 do_cmd_label /home/runner/work/open-simh/open-simh/scp.c:4210 (vax730+0x11795b)
    #16 do_cmd /home/runner/work/open-simh/open-simh/scp.c:4034 (vax730+0x115ede)
    #17 main /home/runner/work/open-simh/open-simh/scp.c:2938 (vax730+0x10e7ff)
SUMMARY: ThreadSanitizer: data race /home/runner/work/open-simh/open-simh/scp.c:435 in sim_aio_activate
==================
markpizz commented 1 year ago

What does this report mean?

Have you looked deeply into and understand completely how asynch I/O works for this activity?

Otherwise, the "problem report" is a false positive.

bscottm commented 1 year ago

@markpizz: "Ready? Fire!! Aim!!" much?

This is the output from the thread sanitizer run (Debug build configuration, so you get nice tracebacks). The ThreadSanitizer description looks for "... two threads [that] access the same variable concurrently and at least one of the accesses is write." Which is what is happening in the traceback.

You can agree to disagree with whether the data race is legitimate, IMHO, it's better to have correct code that passes the sanitizers.

markpizz commented 1 year ago

You have nicely described how the thread sanitizer works.

You have not described how async I/O in simh works and the synchronization it actually performs for simulated I/O.

Someone completely familiar with both the thread analyzer and how simh asynch I/O works will conclude that you've got a false positive. In order to come to this conclusion or come to a full belief that the implementation is broken, you've got to propose a specific detailed fix that addresses the needs of simh and also passes the thread sanitizer.

The simh asynch I/O as worked 100% reliably for almost 15 years and it has even provided reliable behavior in Sergey Oboguev's 32 CPU Multi-processor VAX simulator he released implemented back in 2012 which had a concurrently running thread for each CPU and ran on a 32 CPU host system.

"Ready? Fire!! Aim!!" much? Determine you have a REAL PROBLEM BEFORE firing off problem reports.

bscottm commented 1 year ago

@markpizz: It's a legitimate race because sim_asynch_check is modified by the main and the AIO threads, which is what the thread sanitizer detected. There is no memory fence around reads or writes to sim_asynch_check, which can cause a RAW hazard across cores. This is a speculative OOO pipeline on multiple cores artifact.

I have a presumptive fix that adds a mutex region around accesses to sim_asynch_check. I'm testing the lock-free version that uses __atomic* primitives on GCC/Clang, InterlockedDecrement and friends on the Windows API, defaulting to the mutex version for other platforms.

bscottm commented 1 year ago

And in case you aren't familiar with Hennessy and Patterson, "RAW" is "read after write".

markpizz commented 1 year ago

You have indeed proven that there is a RAW, but you haven't looked deep enough into what/why sim_asynch_check exists.

Although there is technically a data race on this variable, the consequence of a potential collision on access to this variable is 0.

sim_asynch_check exists specifically to avoid acquiring and releasing a mutex (or performing a lock free queue operation) on absolutely every instruction being executed which will definitely impact instruction execution rates. It is decremented on each instruction and when the primary execution thread (which is doing the decrementing) observes it it is <= 0 the mutex (or queue) operation is performed to determine if any asynchronous I/O operations have been completed. Asynchronous I/O completions are very rare events, so a ridiculous number (multiple millions) of instructions will execute between any asynchronous I/O operations completes. When that happens, the completing thread sets sim_asynch_check to 0 which might actually set it to zero or as you've identified sim_asynch_check might not get set and the worst case the latency from when the actual complete operation might be as high as sim_asynch_inst_latency instructions. This will usually be 0 instructions, but might be some random value between 1 and sim_asynch_inst_latency. This worst case will be some relatively small number of microseconds. Since the asynchronous I/O operations that are supported are disk and tape operations which are expected to complete in multiple milliseconds, the potential delay of a few microseconds will never be noticed!

bscottm commented 1 year ago

There are two issues here: (1) sanitizer-detected issues, (2) correctness.

Sanitizer-detected issues: If open-simh decides that running the sanitizers is A Good Thing, then there can be no sanitizer-detected failures. Even if you consider this a "false positive", it's still a failure and will have to be corrected for the CI/CD pipeline to successfully complete its work.

Correctness: The main problem here is that there are both WAW and RAW hazards across the main an AIO thread. It is not entirely true that sim_asynch_check can exist without acquiring a mutex or an atomic operation because both threads alter its value. In effect, sim_asynch_check acts as a semaphore across the two threads. When the AIO thread in sim_aio_activate unlocks the idle mutex (which is sim_asynch_lock but conceptually separate) and forces the queue run on the next AIO_CHECK_EVENT.

And this is the subtle bug that the thread sanitizer has identified: AIO_CHECK_EVENT shouldn't simply write sim_asynch_inst_latency, but should do a compare/exchange instead. If sim_aio_activate wants to force a queue run, then that queue run should happen on the next AIO_CHECK_EVENT because that's the intended behavior to ensure that delay-sensitive events execute.

I've written a potential fix that uses Win32 Interlocked* and GCC/Clang atomics that defaults to a mutex implementation for older compilers. It passes the thread sanitizer, although synchronously consistent memory semantics could be relaxed.

markpizz commented 1 year ago

Sanitizer-detected issues: If open-simh decides that running the sanitizers is A Good Thing, then there can be no sanitizer-detected failures. Even if you consider this a "false positive", it's still a failure and will have to be corrected for the CI/CD pipeline to successfully complete its work.

It is indeed a false positive from a functional point of view. It AS DESIGNED.

Apart from that, simh, from the get-go has always been very deliberately been implemented so that simulator code absolutely run in a single thread. This is so that the behavior is always deterministic and functional correctness of the devices being simulated can be achieved without the simulator developer needing to manage the potential complex timing and synchronization issues needed by separately running threads (or at worse processes). When the asynchronous I/O functionality was added to simh, it still provides the precise deterministic behavior when needed by disabling asynch I/O without any affects to the simulator code for the sake of precise development. The asynch I/O facilities are specificially provided so that devices which perform operations that complete in multiple (or tens of) milliseconds can realistically be simulated without impacting the relative wall clock time of instruction execution. Original simulator code would squeeze in complete I/O activities between instructions. For many simulators this is completely fine or doesn't actually matter and precice wall clock to instruction execution rates doesn't matter. For other simulators that had simulated OSes which actually supported multiple processes which could have different processes executing while other processes were waiting for I/O completion and had wall clocks that meant something, having some random instructions take possibly milliseconds to execute rather than less than a microsecond was certainly not optimal.

The simh codebase has (at the SCP level) optional support for asynchronous I/O for 3 types of devices: DISK, TAPE, and LAN based networking. Your testing with the thread sanitizer has demonstrated completely expected behavior and reports false positive which is completely irrelevant to correct simulator behavior. This functionality has been functional and completely stable going on some 15 years. Some simulators which use SDL for video support have other multi-threaded behaviors explicitly provided by SCP or internally within SDL which you may or may not have exercised in your thread sanitizer tests. These also have been stable for a long time and, I believe, properly synchronize their threading activities. Beyond these basics, if the thread testing can't tolerate the existing false positive, I see no value to built-in regular testing.

Correctness: The main problem here is that there are both WAW and RAW hazards across the main an AIO thread. It is not entirely true that sim_asynch_check can exist without acquiring a mutex or an atomic operation because both threads alter its value. In effect, sim_asynch_check acts as a semaphore across the two threads. When the AIO thread in sim_aio_activate unlocks the idle mutex (which is sim_asynch_lock but conceptually separate) and forces the queue run on the next AIO_CHECK_EVENT.

The point of correctness (to precisely satisfy the thread sanitizer tests) is not only of no value, BUT will absolutely impact performance of the simulator and possible multiple activities on the host system. The impact will be due to the rate (while executing instructions) of performing memory activities which force the host system to synchronize cross processor memory references to satisfy the correctness implementation. As I said earlier, the simulator will be executing likely millions of instructions between asynchronous I/O operations. If the thread completing an I/O happens to successfully set the sim_asynch_check value to 0, that's great. If it doesn't the extra handful of microseconds that instruction execution will continue before the unmodified sim_asynch_check expires has no practical impact on simulator operation.

I've written a potential fix that uses Win32 Interlocked* and GCC/Clang atomics that defaults to a mutex implementation for older compilers. It passes the thread sanitizer, although synchronously consistent memory semantics could be relaxed.

I don't understand what you're saying here. Specifically, the last sentence.

bscottm commented 1 year ago

Mark: Respectfully, you are assuming that there is a strong consistency guarantee across cores. And that's actually reasonable for Intel x86 (but not Itanium) and IBM multicore processors. ARM doesn't make the same cross-core consistency guarantees, if you follow a lot of discussion threads in C++ memory models and compiler support. The new(er) acquire, release and sequential consistency is the result of minds more brilliant than yours and mine really working on thread safety.

It's possible to (carefully) relax an implementation that uses sequential consistency to pair acquire (read barrier, read consistency) and release (write barrier, write consistency).

While this code "works as designed", this potentially causes an issue with ARM-based platforms. Just because they haven't arisen on ARM doesn't mean that they won't. That's the point of the thread sanitizer, namely, to head these kinds of issues off before one spends countless hours trying to reproduce and diagnose.

Sure, it's a false positive on x86 and x86_x64. It's still a failing sanitizer test, no matter how hard you pound the table. The question is how to fix it so that the test passes with the minimal overhead penalty.

markpizz commented 1 year ago

There are thousands of use cases running on ARM every day, and no one has encountered a problem. Someone recently said that Oscar sold some 4000 PiDP11's...

As I said, the fundamental point of the simh code is NOT to depend on primary code paths executing multi-threaded and the very small number of threading activities are know to be safe and aren't elements of change for many years. The value of frequently running a thread sanitizer is very close to 0.

If you want to change the code in sim_aio_check_event() and sim_aio_activate() such that with some macro/#ifdef magic you have a SANITIZER_TEST mode that only happens when building to run the sanitizer tests, and otherwise behaves exactly as it does now, then I'd be glad to look at that. This would allow the sanitizer to avoid failing the known false positives and potentially detect other cases of synchronization issues. Whatever testing you then run will need to exercise the other synchronization paths that leverage other threads. These certainly would be various video interactions which I'm not sure if background sanitizer testing is good at.

bscottm commented 1 year ago

The other possibility is that forcing the SIMH event queue from the AIO thread happens so infrequently, it's potentially not needed (i.e., it doesn't really matter if the event queue runs "now!!" or after the next instruction batch.) That would explain why the race is so infrequently observed in practice and only in the sanitizer -- setting sim_asynch_check to 0 doesn't matter. If setting sim_asynch_check to 0 doesn't matter, then maybe it's safe to remove that code from sim_aio_activate.

FWIW, the sanitizers are not background tests. They are quite active and intercept many library calls. Quite clever how Google did them, actually.

A good paper on Power and ARM weak memory consistency: A Tutorial Introduction to the ARM and POWER Relaxed Memory Models Decent overview of consistency across various processors: Hardware Memory Models -- and why the sim_aio_activate code needs to make a decision on whether setting sim_asynch_check to 0 is really needed.

I frequently go through these same kind of discussions with space vehicle flight code contractors and almost invariably there's a root cause analysis, usually around 6-9 months after launch, that comes back to "yeah, you said it wasn't going to be a problem" and it became one. Sometimes it pays to code correctly.

markpizz commented 1 year ago

That would explain why the race is so infrequently observed in practice.

The race doesn't cause any problem in practice regardless of its frequency. It causes a problem for the thread sanitizer. No one runs a simulator under the sanitizer.

FWIW, the sanitizers are not background tests. They are quite active and intercept many library calls. Quite clever how Google did them, actually.

Who interacts with them in the CI context?

I frequently go through these same kind of discussions with space vehicle flight code contractors and almost invariably there's a root cause analysis, usually around 6-9 months after launch, that comes back to "yeah, you said it wasn't going to be a problem" and it became one. Sometimes it pays to code correctly.

You've done a similar job here and found a false positive which was deliberately implemented. Like I said, if you want to create a mechanism to avoid the false positive when the sanitizer tests are run, lets see it.

bscottm commented 1 year ago

That would explain why the race is so infrequently observed in practice.

The race doesn't cause any problem in practice regardless of its frequency. It causes a problem for the thread sanitizer. No one runs a simulator under the sanitizer.

You're making the argument that setting sim_asynch_check to 0 doesn't matter.

FWIW, the sanitizers are not background tests. They are quite active and intercept many library calls. Quite clever how Google did them, actually.

Who interacts with them in the CI context?

It's a combination of clever header file call interceptions and compiler support. Most of the work was initially done in Clang. No one interacts with ASan, MSan or TSan "directly."

I frequently go through these same kind of discussions with space vehicle flight code contractors and almost invariably there's a root cause analysis, usually around 6-9 months after launch, that comes back to "yeah, you said it wasn't going to be a problem" and it became one. Sometimes it pays to code correctly.

You've done a similar job here and found a false positive which was deliberately implemented. Like I said, if you want to create a mechanism to avoid the false positive when the sanitizer tests are run, lets see it.

I don't have the time today to go through the possible interleavings under various hardware memory models, but clearly it really doesn't matter if the event queue needs to really fire "now!!" from sim_aio_activate since both threads can use their local value and nothing untoward happens in SIMH. That makes a decent argument that sim_aio_activate doesn't need to ever touch sim_asynch_check.

bscottm commented 11 months ago

Closing with intention to reopen with a more detailed analysis.