sandialabs / qthreads

Lightweight locality-aware user-level threading runtime.
https://www.sandia.gov/qthreads/
Other
173 stars 35 forks source link

Race Conditions in `qutil` #238

Open insertinterestingnamehere opened 8 months ago

insertinterestingnamehere commented 8 months ago

I'm seeing a handful of thread sanitizer errors about race conditions in the qutil code.

One example: apparently the macro invoked at https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L152, https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L156, and https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L160 includes both a read and a corresponding atomic write from another thread.

Another example: non-atomic read: https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L897 atomic write to the same variable: https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L903

Similar: https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L909 https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L915

Another more unusual one: The macro invoked at https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/qutil.c#L170 has a non-atomic write as well as an atomic write to the same variable from a different thread via qthread_syncvar_fill at https://github.com/sandialabs/qthreads/blob/a2d1dbadb80a161a4a287c1056348808b022ba04/src/syncvar.c#L609.

insertinterestingnamehere commented 7 months ago

Poking at some of the memory fences for syncvars seems like a partial fix. Some additional atomics in the nemesis scheduler may be necessary to get the sequencing right, but at this point I'm still not sure. The weird thing though is that some of the thread sanitizer tracebacks here seem to be getting garbled and it's not at all clear why/how that's happening. It may be that a better-debugged https://github.com/sandialabs/qthreads/pull/249 is needed to fix these warnings.

insertinterestingnamehere commented 2 weeks ago

Update on this one, I'm fairly sure a lot (all?) of these errors are caused by thread sanitizer not being able to properly follow dependencies through the atomic logic in our various threadqueues. It's fixable, but it'll require an extension beyond the interop in #249.