Some helgrind tests fail when swarm is compiled for coverage testing

torognes commented 4 years ago

Some of the swarm tests that use helgrind, a part of valgrind, fail when swarm is compiled for coverage testing (more specificly with the -fprofile-arcs option).

The tests that run swarm with nonzero d (e.g. -d 1 or -d 2) and more than one thread (e.g. -t 2) fail due to reports of possible data race conditions, like below (with -d 1 -t 2):

==5473== Lock at 0x52F27A0 was first observed
==5473==    at 0x4A0BDA3: pthread_mutex_init (hg_intercepts.c:429)
==5473==    by 0x41D063: ThreadRunner::ThreadRunner(int, void (*)(long)) (threads.h:87)
==5473==    by 0x41A61A: algo_d1_run() (algod1.cc:757)
==5473==    by 0x4049F5: main (swarm.cc:709)
==5473== 
==5473== Lock at 0x52F2728 was first observed
==5473==    at 0x4A0BDA3: pthread_mutex_init (hg_intercepts.c:429)
==5473==    by 0x41D063: ThreadRunner::ThreadRunner(int, void (*)(long)) (threads.h:87)
==5473==    by 0x41A61A: algo_d1_run() (algod1.cc:757)
==5473==    by 0x4049F5: main (swarm.cc:709)
==5473== 
==5473== Possible data race during write of size 8 at 0x639518 by thread #3
==5473== Locks held: 1, at address 0x52F27A0
==5473==    at 0x41CEEF: ThreadRunner::worker(void*) (threads.h:59)
==5473==    by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==5473==    by 0x3D2BC07AA0: start_thread (in /lib64/libpthread-2.12.so)
==5473==    by 0x3D2B0E8C4C: clone (in /lib64/libc-2.12.so)
==5473== 
==5473== This conflicts with a previous write of size 8 by thread #2
==5473== Locks held: 1, at address 0x52F2728
==5473==    at 0x41CEEF: ThreadRunner::worker(void*) (threads.h:59)
==5473==    by 0x4A0C0D4: mythread_wrapper (hg_intercepts.c:219)
==5473==    by 0x3D2BC07AA0: start_thread (in /lib64/libpthread-2.12.so)
==5473==    by 0x3D2B0E8C4C: clone (in /lib64/libc-2.12.so)

torognes commented 4 years ago

It is possible that this is a false positive reported by helgrind due to the use of POSIX condition variables. Paragraph 3 under section 7.5 in the helgrind documentation encourages the use of POSIX semaphores instead of condition variables because helgrind is unable to fully control the latter and may report false positives.

The problem may only appear when compiling for coverage testing because the program runs much slower than usual and perhaps not in parallel.

torognes commented 4 years ago

I'll try to rewrite the code to use semaphores instead of condition variables to see if that helps. Other have reported better performance with semaphores as well.

torognes commented 4 years ago

It seems like the problems persist even if not using condition variables. They may be caused directly by the modifications made by the compiler when preparing for coverage testing.

torognes commented 4 years ago

We could compile swarm twice, once normally for ordinary testing and once for coverage testing. The first time we fail if any of the tests fail. The second time we just ignore failed tests.

Alternatively, we could just disable the helgrind tests.

frederic-mahe commented 4 years ago

Alternatively, we could just disable the helgrind tests.

I vote for that option. I think it is safe to assume that the helgrind warnings we get are a side effect of the code coverage instrumentation, as they do not occur when testing our release binary. We can always enable helgrind tests again when we want to validate modifications of the threading implementation.

torognes commented 4 years ago

Ok, let's keep the helgrind tests disabled for now.

torognes / swarm

Some helgrind tests fail when swarm is compiled for coverage testing #151