ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

spinlock unlocked twice in fi_enable gni provider code #1413

Closed soumagne closed 6 years ago

soumagne commented 6 years ago

I'm getting the warning below when running the thread sanitizer with the gni provider. It appears that the fastlock created in gnix_domain_open is unlocked twice when calling fi_enable() on an endpoint. I believe you can reproduce it if you compile with -fanitize=thread and run any example that calls fi_enable(). Sorry my trace could not get all of the libfabric stack because of all the inline/define calls.

WARNING: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong thread) (pid=16531)
    #0 pthread_spin_unlock ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libsanitizer/tsan/tsan_interceptors.cc:1200 (libtsan.so.0+0x00000002cd99)
    #1 fastlock_release include/fi_lock.h:120 (libfabric.so.1+0x0000000bb701)
    #2 na_ofi_endpoint_open /home/soumagne/workspace/mercury/source/src/na/na_ofi.c:1416 (libna_debug.so.0.9.1+0x00000040de35)
    #3 na_ofi_initialize /home/soumagne/workspace/mercury/source/src/na/na_ofi.c:1728 (libna_debug.so.0.9.1+0x00000040ef27)
    #4 NA_Initialize /home/soumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000405584)
    #5 NA_Test_server_init /home/soumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409442)
    #6 HG_Test_server_init /home/soumagne/workspace/mercury/source/Testing/mercury_test.c:382 (hg_test_server+0x000000405188)
    #7 main /home/soumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x000000402b7e)

  Location is heap block of size 360 at 0x7b4800000180 allocated by main thread:
    #0 calloc ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libsanitizer/tsan/tsan_interceptors.cc:606 (libtsan.so.0+0x00000002b196)
    #1 gnix_domain_open prov/gni/src/gnix_dom.c:607 (libfabric.so.1+0x0000000928f7)
    #2 na_ofi_domain_open /home/soumagne/workspace/mercury/source/src/na/na_ofi.c:1131 (libna_debug.so.0.9.1+0x00000040cc8e)
    #3 na_ofi_initialize /home/soumagne/workspace/mercury/source/src/na/na_ofi.c:1719 (libna_debug.so.0.9.1+0x00000040ee5a)
    #4 NA_Initialize /home/soumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000405584)
    #5 NA_Test_server_init /home/soumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409442)
    #6 HG_Test_server_init /home/soumagne/workspace/mercury/source/Testing/mercury_test.c:382 (hg_test_server+0x000000405188)
    #7 main /home/soumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x000000402b7e)

  Mutex M27 (0x7b48000002b0) created at:
    #0 pthread_spin_init ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libsanitizer/tsan/tsan_interceptors.cc:1164 (libtsan.so.0+0x00000002c9f0)
    #1 fastlock_init include/fi_lock.h:85 (libfabric.so.1+0x00000008dcd3)
    #2 na_ofi_domain_open /home/soumagne/workspace/mercury/source/src/na/na_ofi.c:1131 (libna_debug.so.0.9.1+0x00000040cc8e)
    #3 na_ofi_initialize /home/soumagne/workspace/mercury/source/src/na/na_ofi.c:1719 (libna_debug.so.0.9.1+0x00000040ee5a)
    #4 NA_Initialize /home/soumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000405584)
    #5 NA_Test_server_init /home/soumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409442)
    #6 HG_Test_server_init /home/soumagne/workspace/mercury/source/Testing/mercury_test.c:382 (hg_test_server+0x000000405188)
    #7 main /home/soumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x000000402b7e)

SUMMARY: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong thread) include/fi_lock.h:120 in fastlock_release
hppritcha commented 6 years ago

@soumagne I'll take this. Thanks for reporting!

hppritcha commented 6 years ago

probably too late for 1.5.1

hppritcha commented 6 years ago

hmm... I seem to have problems getting configure to work with -fsanitize=thread. Which modules are you using on cori?

soumagne commented 6 years ago

No problem wrt 1.5.1. Sorry I was using theta but I just tried on cori and it seems fine too. Here are the modules that I have loaded.

Currently Loaded Modulefiles:
  1) modules/3.2.10.6
  2) nsg/1.2.0
  3) gcc/6.3.0
  4) craype-haswell
  5) craype-network-aries
  6) craype/2.5.12
  7) cray-mpich/7.6.0
  8) cray-shmem/7.6.0
  9) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari
 10) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari
 11) pmi/5.0.12
 12) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari
 13) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari
 14) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari
 15) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari
 16) dvs/2.7_2.2.31-6.0.4.1_6.1__gb3b87e6
 17) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari
 18) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari
 19) PrgEnv-gnu/6.0.4
 20) git/2.9.1

Also here is what I pass to configure:

./configure --prefix=$HOME/apps/libfabric/1.6.0a1_${hash} --enable-debug \
  --disable-silent-rules --disable-verbs --disable-udp --disable-rxm --disable-rxd \
  --disable-usnic CFLAGS="-fsanitize=thread -fno-omit-frame-pointer"

I assume the test you will run is within the gni provider package itself but if your test is external you only need to pass -fsanitize=thread when you build the actual test/executable (if let's say you were to use the fabtests).

hppritcha commented 6 years ago

if cori is still available, could you try testing against the upstream 1.5.1 release? There's more recent stuff in master that may be causing the problem.

soumagne commented 6 years ago

Hi Howard, I just tried on cori and it seems to be the same. The good news though is that this time it gives the exact location, looks like it is in line 207 of gnix_mr.c. See below.

jsoumagn@nid00445:~/workspace/mercury/build> srun -N 1 ./bin/hg_test_server -c ofi -p gni -H ipogif0
==================
WARNING: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong thread) (pid=39875)
    #0 pthread_spin_unlock ../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libsanitizer/tsan/tsan_interceptors.cc:1174 (libtsan.so.0+0x00000002bf04)
    #1 __mr_reg prov/gni/src/gnix_mr.c:207 (libfabric.so.1+0x00000007597f)
    #2 gnix_mr_regattr prov/gni/src/gnix_mr.c:321 (libfabric.so.1+0x00000007597f)
    #3 na_ofi_endpoint_open /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1416 (libna_debug.so.0.9.1+0x00000000da78)
    #4 na_ofi_initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1728 (libna_debug.so.0.9.1+0x00000000eb32)
    #5 NA_Initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x0000000054af)
    #6 NA_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000408239)
    #7 HG_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/mercury_test.c:386 (hg_test_server+0x000000404d57)
    #8 main /global/homes/j/jsoumagn/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004026f2)

  Location is heap block of size 344 at 0x7d480000bb80 allocated by main thread:
    #0 calloc ../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libsanitizer/tsan/tsan_interceptors.cc:555 (libtsan.so.0+0x000000029db1)
    #1 gnix_domain_open prov/gni/src/gnix_dom.c:594 (libfabric.so.1+0x00000005b2d4)
    #2 na_ofi_domain_open /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1131 (libna_debug.so.0.9.1+0x00000000c91f)
    #3 na_ofi_initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1719 (libna_debug.so.0.9.1+0x00000000ea6f)
    #4 NA_Initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x0000000054af)
    #5 NA_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000408239)
    #6 HG_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/mercury_test.c:386 (hg_test_server+0x000000404d57)
    #7 main /global/homes/j/jsoumagn/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004026f2)

  Mutex M27 (0x7d480000bca8) created at:
    #0 pthread_spin_init ../../../../cray-gcc-6.3.0-201701050407.93fe37becc347/libsanitizer/tsan/tsan_interceptors.cc:1138 (libtsan.so.0+0x00000002bc0b)
    #1 gnix_domain_open prov/gni/src/gnix_dom.c:618 (libfabric.so.1+0x00000005b387)
    #2 na_ofi_domain_open /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1131 (libna_debug.so.0.9.1+0x00000000c91f)
    #3 na_ofi_initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na_ofi.c:1719 (libna_debug.so.0.9.1+0x00000000ea6f)
    #4 NA_Initialize /global/homes/j/jsoumagn/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x0000000054af)
    #5 NA_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000408239)
    #6 HG_Test_server_init /global/homes/j/jsoumagn/workspace/mercury/source/Testing/mercury_test.c:386 (hg_test_server+0x000000404d57)
    #7 main /global/homes/j/jsoumagn/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004026f2)

SUMMARY: ThreadSanitizer: unlock of an unlocked mutex (or by a wrong thread) prov/gni/src/gnix_mr.c:207 in __mr_reg
==================
# Initializing NA with ofi+gni://ipogif0:22222
# Waiting for client...
soumagne commented 6 years ago

If you look there, there is indeed a

 /* call cache register op to retrieve the right entry */
        fastlock_release(&domain->mr_cache_lock);
        fastlock_acquire(&info->mr_cache_lock);
soumagne commented 6 years ago

On master, it is line 285 of gnix_mr.c, line 207 is for tag v1.5.1

hppritcha commented 6 years ago

excellent sleuthing! thanks.