Closed timwoj closed 2 years ago
I like the comment above the line reported in the first one:
/* Hopefully we aren't racing with another thread, but you never know.. */
Hmm fun. OK, could you send a PR to run with TSAN under CI seeing as you appear to have a working invocation already :)
Sure. Pretty much all I did was set CFLAGS=-fsanitize=thread
prior to calling CMake.
Opened a PR, but I'm not sure what magic I need to get Travis to kick off a build from it. I thought it was originally that I didn't open it from a master branch, but that wasn't it.
Opened a PR, but I'm not sure what magic I need to get Travis to kick off a build from it. I thought it was originally that I didn't open it from a master branch, but that wasn't it.
Oh, it appears you guys are still on travis-ci.org, which is in read-only mode and won't let any new builds start. It needs to be migrated over to travis-ci.com (https://docs.travis-ci.com/user/migrate/open-source-repository-migration/#frequently-asked-questions). It looks like the last successful CI build was 10 months ago: https://travis-ci.org/github/mheily/libkqueue/requests.
sigh that's not something I can fix unfortunately as I don't have admin rights for the repo. @mheily could you take a look at redoing the travis integration?
@arr2036 I have migrated the Travis CI repositories to travis-ci.com, and updated the build status image in README.md. It currently shows several failing jobs where TSAN is enabled; example: https://travis-ci.com/github/mheily/libkqueue/builds/235346115
Those look like they match the errors I was seeing as well. Thanks @mheily !
I posted a PR in #107 that seem to fix the races in map.c, but am still seeing other data races in linux_kqueue_cleanup() and kqueue_free()
f3eab262fcf961cfd465c0f52fa6896a4782aca2 fixes the map issues. The others do appear to be genuine races. We need to hold the global lock before attempting to resolve the fd to a struct kqueue *
, and continue to hold it until we lock the struct kqueue
specific mutex. We also need to hold both the global mutex and the struct kqueue
specific mutex in kqueue_free()
.
I've started working on fixes. I'm planning to add the following extensions to control whether the mutexes are used (unless you have an objection @mheily?), as these extra mutexes are not required in applications which do not share kqueues between threads.
The reason why I wanted to avoid rwlocks in map.c
is because in our application there's actually a fair amount of churn in that table. All threads in our application create and destroy temporary event loops, and so even with an r/w lock, any time the table needed to be updated (during the creation of a temporary event loop), we'd block other threads trying to resolve file descriptors to kqueues.
#define EVFILT_LIBKQUEUE (-12) //!< libkqueue specific extensions.
/** @name libkqueue extension flags
*
* Unless otherwise specified EV_ADD/EV_ENABLE toggle the configuration option to true.
* Unless otherwise specified EV_DELETE/EV_DISABLE toggle the configuration option to false.
*
* @{
*/
#define NOTE_LIBKQUEUE_THREADSAFE_MAP 0x0001
//!< Synchronise concurrent accesses to the
///< fd <-> kqueue mapping table.
///< Only needed if kqueues are created by
///< one thread and closed in another.
///< default - enabled.
#define NOTE_LIBKQUEUE_THREADSAFE 0x0001
///< Synchronise concurrent operations on a
///< given kqueue.
///< May be disabled if a given kqueue is only
///< used/closed by the thread that created it.
///< default - enabled.
#define NOTE_LIBKQUEUE_LOG_FUNC 0x0002 //!< For debug builds, call the logging function
///< provided in data.
/** @} */
Any luck with the rest of these? I'm looking at upgrading our local libkqueue submodule pointer, but I'd love to have this stuff in a release before I update it.
I can confirm I've noted these same TSAN issues in opencflite and opencfnetwork:
make[2]: Entering directory '${HOME}/src/git/github.com/gerickson/opencflite/tmp-build-fsanitize-thread/examples/CFRunLoopTimerExample'
make CFRunLoopTimerExample
make[3]: Entering directory '${HOME}/src/git/github.com/gerickson/opencflite/tmp-build-fsanitize-thread/examples/CFRunLoopTimerExample'
make[3]: 'CFRunLoopTimerExample' is up to date.
make[3]: Leaving directory '${HOME}/src/git/github.com/gerickson/opencflite/tmp-build-fsanitize-thread/examples/CFRunLoopTimerExample'
/bin/bash ../../libtool --mode execute ./CFRunLoopTimerExample 0.25 5.00
Will fire a total of 1 timer for 5 seconds.
Will fire timer 0 every 0.25 seconds for 5 seconds, up to 20 times (with a tolerance of 1 time).
2021-12-01 07:13:04.175, timer: 0, delta: 0.250, variance: +0.000, iteration: 0
2021-12-01 07:13:04.425, timer: 0, delta: 0.250, variance: +0.000, iteration: 1
2021-12-01 07:13:04.675, timer: 0, delta: 0.250, variance: -0.000, iteration: 2
2021-12-01 07:13:04.925, timer: 0, delta: 0.250, variance: +0.000, iteration: 3
2021-12-01 07:13:05.175, timer: 0, delta: 0.250, variance: -0.000, iteration: 4
2021-12-01 07:13:05.426, timer: 0, delta: 0.251, variance: +0.001, iteration: 5
2021-12-01 07:13:05.675, timer: 0, delta: 0.249, variance: -0.001, iteration: 6
2021-12-01 07:13:05.925, timer: 0, delta: 0.250, variance: -0.000, iteration: 7
2021-12-01 07:13:06.176, timer: 0, delta: 0.251, variance: +0.001, iteration: 8
2021-12-01 07:13:06.425, timer: 0, delta: 0.250, variance: -0.000, iteration: 9
2021-12-01 07:13:06.676, timer: 0, delta: 0.250, variance: +0.000, iteration: 10
2021-12-01 07:13:06.926, timer: 0, delta: 0.250, variance: +0.000, iteration: 11
2021-12-01 07:13:07.175, timer: 0, delta: 0.249, variance: -0.001, iteration: 12
2021-12-01 07:13:07.425, timer: 0, delta: 0.250, variance: +0.000, iteration: 13
2021-12-01 07:13:07.675, timer: 0, delta: 0.250, variance: -0.000, iteration: 14
2021-12-01 07:13:07.925, timer: 0, delta: 0.250, variance: +0.000, iteration: 15
2021-12-01 07:13:08.175, timer: 0, delta: 0.250, variance: -0.000, iteration: 16
2021-12-01 07:13:08.426, timer: 0, delta: 0.251, variance: +0.001, iteration: 17
2021-12-01 07:13:08.675, timer: 0, delta: 0.249, variance: -0.001, iteration: 18
2021-12-01 07:13:08.926, timer: 0, delta: 0.250, variance: +0.000, iteration: 19
==================
WARNING: ThreadSanitizer: data race (pid=193538)
Write of size 8 at 0x7ba000000030 by thread T1 (mutexes: write M14):
#0 close ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1675 (libtsan.so.0+0x3458a)
#1 linux_kqueue_cleanup ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:396 (libkqueue.so.0+0x88cd)
Previous read of size 8 at 0x7ba000000030 by main thread (mutexes: write M12, write M3794586426146832, write M16):
#0 epoll_ctl ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1809 (libtsan.so.0+0x544bc)
#1 linux_evfilt_user_knote_delete ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/user.c:183 (libkqueue.so.0+0x100a9)
#2 __CFRunLoopDeallocate ../CFRunLoop.c:2028 (libCoreFoundation.so.635+0xe4b99)
#3 __CFRunLoopDestroyRunLoop ../CFRunLoop.c:2161 (libCoreFoundation.so.635+0xe51c3)
#4 CFDictionaryApplyFunction ../CFDictionary.c:736 (libCoreFoundation.so.635+0x9f863)
#5 __CFRunLoopDestroyRunLoops ../CFRunLoop.c:2178 (libCoreFoundation.so.635+0xe525b)
#6 __CFRunLoopDestroy ../CFRunLoop.c:2199 (libCoreFoundation.so.635+0xe52e7)
#7 <null> <null> (ld-linux-x86-64.so.2+0x11f5a)
Location is file descriptor 3 created by main thread at:
#0 epoll_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1793 (libtsan.so.0+0x32d7a)
#1 linux_kqueue_init ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:271 (libkqueue.so.0+0x813a)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Mutex M14 (0x7fc9776cd2e0) created at:
#0 pthread_mutex_lock ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4165 (libtsan.so.0+0x526fc)
#1 kqueue ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kqueue.c:173 (libkqueue.so.0+0x6e1e)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Mutex M12 (0x7fc977c1408c) created at:
#0 pthread_spin_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1268 (libtsan.so.0+0x36aef)
#1 __CFSpinLock ../CFInternal.h:480 (libCoreFoundation.so.635+0xe0736)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2209 (libCoreFoundation.so.635+0xe5389)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Mutex M3794586426146832 is already destroyed.
Mutex M16 (0x7b94000007f0) created at:
#0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1220 (libtsan.so.0+0x4a616)
#1 kqueue ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kqueue.c:182 (libkqueue.so.0+0x6e78)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Thread T1 'libkqueue_mon' (tid=193565, running) created by main thread at:
#0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x5ea79)
#1 linux_kqueue_start_thread ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:257 (libkqueue.so.0+0x806e)
#2 linux_kqueue_init ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:350 (libkqueue.so.0+0x8619)
#3 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#4 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#5 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
SUMMARY: ThreadSanitizer: data race ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:396 in linux_kqueue_cleanup
==================
==================
WARNING: ThreadSanitizer: data race (pid=193538)
Write of size 1 at 0x7b94000007f0 by thread T1 (mutexes: write M14):
#0 pthread_mutex_destroy ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1237 (libtsan.so.0+0x3725c)
#1 kqueue_free ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kqueue.c:132 (libkqueue.so.0+0x6d7c)
Previous atomic read of size 1 at 0x7b94000007f0 by main thread (mutexes: write M12, write M3794586426146832, write M4639475212748784):
#0 pthread_mutex_unlock ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4179 (libtsan.so.0+0x3aadc)
#1 kevent ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kevent.c:321 (libkqueue.so.0+0x49f5)
#2 __CFRunLoopDeallocate ../CFRunLoop.c:2028 (libCoreFoundation.so.635+0xe4b99)
#3 __CFRunLoopDestroyRunLoop ../CFRunLoop.c:2161 (libCoreFoundation.so.635+0xe51c3)
#4 CFDictionaryApplyFunction ../CFDictionary.c:736 (libCoreFoundation.so.635+0x9f863)
#5 __CFRunLoopDestroyRunLoops ../CFRunLoop.c:2178 (libCoreFoundation.so.635+0xe525b)
#6 __CFRunLoopDestroy ../CFRunLoop.c:2199 (libCoreFoundation.so.635+0xe52e7)
#7 <null> <null> (ld-linux-x86-64.so.2+0x11f5a)
Location is heap block of size 8256 at 0x7b9400000000 allocated by main thread:
#0 calloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:668 (libtsan.so.0+0x305aa)
#1 kqueue ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kqueue.c:178 (libkqueue.so.0+0x6e4c)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Mutex M14 (0x7fc9776cd2e0) created at:
#0 pthread_mutex_lock ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:4165 (libtsan.so.0+0x526fc)
#1 kqueue ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kqueue.c:173 (libkqueue.so.0+0x6e1e)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Mutex M12 (0x7fc977c1408c) created at:
#0 pthread_spin_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1268 (libtsan.so.0+0x36aef)
#1 __CFSpinLock ../CFInternal.h:480 (libCoreFoundation.so.635+0xe0736)
#2 _CFRunLoopGet0 ../CFRunLoop.c:2209 (libCoreFoundation.so.635+0xe5389)
#3 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#4 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
Mutex M3794586426146832 is already destroyed.
Mutex M4639475212748784 is already destroyed.
Thread T1 'libkqueue_mon' (tid=193565, running) created by main thread at:
#0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:962 (libtsan.so.0+0x5ea79)
#1 linux_kqueue_start_thread ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:257 (libkqueue.so.0+0x806e)
#2 linux_kqueue_init ${HOME}/src/git/github.com/gerickson/libkqueue/src/linux/platform.c:350 (libkqueue.so.0+0x8619)
#3 _CFRunLoopGet0 ../CFRunLoop.c:2213 (libCoreFoundation.so.635+0xe53fc)
#4 CFRunLoopGetCurrent ../CFRunLoop.c:2341 (libCoreFoundation.so.635+0xe5b1f)
#5 main ../../../examples/CFRunLoopTimerExample/CFRunLoopTimerExample.c:414 (CFRunLoopTimerExample+0x1eeb)
SUMMARY: ThreadSanitizer: data race ${HOME}/src/git/github.com/gerickson/libkqueue/src/common/kqueue.c:132 in kqueue_free
==================
ThreadSanitizer: reported 2 warnings
make[2]: *** [Makefile:634: check] Error 66
make[2]: Leaving directory '${HOME}/src/git/github.com/gerickson/opencflite/tmp-build-fsanitize-thread/examples/CFRunLoopTimerExample'
The dependent project was configured with:
% ./configure -C --with-icu=/usr --with-tz-includes=${HOME}/tmp/tzcode2021a CFLAGS="-O0 -g -fsanitize=thread" LDFLAGS="-O0 -g -fsanitize=thread" CPPFLAGS="-O0 -g -fsanitize=thread"
and built and tested with:
% make -j3 && make check
All errors reported by TSAN have now been fixed and locking has been simplified substantially.
For common code we now have two types of mutex, the global mutex which is used to synchronise the fd map, the linked list of active kqueues, and the global kqueue counter, and kq specific mutexes which are used to prevent conflicting operations on the same kq.
I've removed the filter specific mutexes, there was never much of a reason to use them. This means only a single thread can operate on a kq at a given time.
If you're still experiencing issues with current HEAD, send over a PR with a test case which reproduces the issue and I'll take a look.
Great, thanks for getting this sorted out! I'll pull the new version over into our builds this week and let you know if we run into anything else.
Awesome, thank you! I'll pull from top of tree and get this verified in my dependent projects.
Looks solid from my end. Huge thanks again from the Zeek team for fixing all of these.
@timwoj great, thanks for confirming. Just checking a hang-on-exit issue is fixed in another project and I'll cut a new release.
OK, it's fixed. No more hanging CI jobs. v2.6.0 has been tagged.
Confirmed with opencflite that all thread sanitizer issues have been resolved. Thanks!
@gerickson awesome, thanks for confirming!
I've been adding support for ThreadSanitizer to the zeek project (which uses libkqueue) and it was reporting issues out of libkqueue. It appears the same issues happen with the built-in tests. ctest+tsan output is below. I haven't looked at whether any of these are easy fixes.