Closed soumagne closed 6 years ago
@soumagne I'll take this. Thanks for reporting!
probably too late for 1.5.1
hmm... I seem to have problems getting configure to work with -fsanitize=thread
. Which modules are you using on cori?
No problem wrt 1.5.1. Sorry I was using theta but I just tried on cori and it seems fine too. Here are the modules that I have loaded.
Currently Loaded Modulefiles:
1) modules/3.2.10.6
2) nsg/1.2.0
3) gcc/6.3.0
4) craype-haswell
5) craype-network-aries
6) craype/2.5.12
7) cray-mpich/7.6.0
8) cray-shmem/7.6.0
9) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari
10) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari
11) pmi/5.0.12
12) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari
13) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari
14) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari
15) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari
16) dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6
17) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari
18) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari
19) PrgEnv-gnu/6.0.4
20) git/2.9.1
Also here is what I pass to configure:
./configure --prefix=$HOME/apps/libfabric/1.6.0a1_${hash} --enable-debug \
--disable-silent-rules --disable-verbs --disable-udp --disable-rxm --disable-rxd \
--disable-usnic CFLAGS="-fsanitize=thread -fno-omit-frame-pointer"
I assume the test you will run is within the gni provider package itself but if your test is external you only need to pass -fsanitize=thread
when you build the actual test/executable (if let's say you were to use the fabtests).
if cori is still available, could you try testing against the upstream 1.5.1 release? There's more recent stuff in master that may be causing the problem.
Hi Howard, I just tried on cori and it seems to be the same. The good news though is that this time it gives the exact location, looks like it is in line 207 of gnix_mr.c
. See below.
jsoumagn@nid00445:~/workspace/mercury/build> srun -N 1 ./bin/hg_test_server -c ofi -p gni -H ipogif0
==================
WARNING: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong thread) (pid=39875)
#0 pthread_spin_unlock ../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libsanitizer/tsan/tsan_interceptors.cc:1174 (libtsan.so.0+0x00000002bf04)
#1 __mr_reg prov/gni/src/gnix_mr.c:207 (libfabric.so.1+0x00000007597f)
#2 gnix_mr_regattr prov/gni/src/gnix_mr.c:321 (libfabric.so.1+0x00000007597f)
#3 na_ofi_endpoint_open /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1416 (libna_debug.so.0.9.1+0x00000000da78)
#4 na_ofi_initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1728 (libna_debug.so.0.9.1+0x00000000eb32)
#5 NA_Initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x0000000054af)
#6 NA_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000408239)
#7 HG_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/mercury_test.c:386 (hg_test_server+0x000000404d57)
#8 main /global/homes/j/jsoumagn/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004026f2)
Location is heap block of size 344 at 0x7d480000bb80 allocated by main thread:
#0 calloc ../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libsanitizer/tsan/tsan_interceptors.cc:555 (libtsan.so.0+0x000000029db1)
#1 gnix_domain_open prov/gni/src/gnix_dom.c:594 (libfabric.so.1+0x00000005b2d4)
#2 na_ofi_domain_open /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1131 (libna_debug.so.0.9.1+0x00000000c91f)
#3 na_ofi_initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1719 (libna_debug.so.0.9.1+0x00000000ea6f)
#4 NA_Initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x0000000054af)
#5 NA_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000408239)
#6 HG_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/mercury_test.c:386 (hg_test_server+0x000000404d57)
#7 main /global/homes/j/jsoumagn/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004026f2)
Mutex M27 (0x7d480000bca8) created at:
#0 pthread_spin_init ../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libsanitizer/tsan/tsan_interceptors.cc:1138 (libtsan.so.0+0x00000002bc0b)
#1 gnix_domain_open prov/gni/src/gnix_dom.c:618 (libfabric.so.1+0x00000005b387)
#2 na_ofi_domain_open /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1131 (libna_debug.so.0.9.1+0x00000000c91f)
#3 na_ofi_initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1719 (libna_debug.so.0.9.1+0x00000000ea6f)
#4 NA_Initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x0000000054af)
#5 NA_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000408239)
#6 HG_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/mercury_test.c:386 (hg_test_server+0x000000404d57)
#7 main /global/homes/j/jsoumagn/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004026f2)
SUMMARY: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong thread) prov/gni/src/gnix_mr.c:207 in __mr_reg
==================
# Initializing NA with ofi+gni://ipogif0:22222
# Waiting for client...
If you look there, there is indeed a
/* call cache register op to retrieve the right entry */
fastlock_release(&domain->mr_cache_lock);
fastlock_acquire(&info->mr_cache_lock);
On master, it is line 285 of gnix_mr.c
, line 207 is for tag v1.5.1
excellent sleuthing! thanks.
I'm getting the warning below when running the thread sanitizer with the gni provider. It appears that the fastlock created in gnix_domain_open is unlocked twice when calling
fi_enable()
on an endpoint. I believe you can reproduce it if you compile with-fanitize=thread
and run any example that callsfi_enable()
. Sorry my trace could not get all of the libfabric stack because of all the inline/define calls.