Data races in socket provider

soumagne commented 7 years ago

When using the thread sanitizer with the socket provider, we're getting several warnings coming from the libfabric library. Looking at the code it seems that ofi_rbpeek / ofi_rbwrite is not protected? Note that in that particular example, we have one thread doing progress (i.e. calling fi_cq_read()) and the other thread is calling fi_readv(). There are some more warnings but I'll go one at a time for now. Also thread safety requested in our code is FI_THREAD_UNSPEC, which it looks like defaults to FI_THREAD_SAFE anyway.

(For reference: NA OFI code from the stack trace is located here)

WARNING: ThreadSanitizer: data race (pid=15789)
  Read of size 8 at 0x7f4ff2c1c000 by thread T10 (mutexes: write M37, write M24, write M29):
    #0 memcpy <null> (libtsan.so.0+0x000000027351)
    #1 ofi_rbpeek include/fi_rbuf.h:189 (libfabric.so.1+0x00000004f828)
    #2 na_ofi_progress /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:3292 (libna_debug.so.0.9.1+0x000000027d11)
    #3 NA_Progress /home/jsoumagne/workspace/mercury/source/src/na/na.c:1829 (libna_debug.so.0.9.1+0x00000000aeec)
    #4 hg_core_progress_na /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:2474 (libmercury_debug.so.0.9.1+0x00000000b95a)
    #5 HG_Core_progress /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:4008 (libmercury_debug.so.0.9.1+0x00000000fa5a)
    #6 HG_Progress /home/jsoumagne/workspace/mercury/source/src/mercury.c:1439 (libmercury_debug.so.0.9.1+0x000000012fdb)
    #7 hg_progress_thread /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:42 (hg_test_server+0x000000403748)
    #8 <null> <null> (libtsan.so.0+0x000000024549)

  Previous write of size 8 at 0x7f4ff2c1c000 by thread T2 (mutexes: write M30):
    #0 memcpy <null> (libtsan.so.0+0x000000027351)
    #1 ofi_rbwrite include/fi_rbuf.h:165 (libfabric.so.1+0x00000004f707)
    #2 na_ofi_get /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:3007 (libna_debug.so.0.9.1+0x000000026a91)
    #3 NA_Get /home/jsoumagne/workspace/mercury/source/src/na/na.c:1683 (libna_debug.so.0.9.1+0x00000000a921)
    #4 hg_bulk_na_get /home/jsoumagne/workspace/mercury/source/src/mercury_bulk.c:255 (libmercury_debug.so.0.9.1+0x0000000162fb)
    #5 hg_bulk_transfer_pieces /home/jsoumagne/workspace/mercury/source/src/mercury_bulk.c:677 (libmercury_debug.so.0.9.1+0x000000017a88)
    #6 hg_bulk_transfer /home/jsoumagne/workspace/mercury/source/src/mercury_bulk.c:824 (libmercury_debug.so.0.9.1+0x000000018236)
    #7 HG_Bulk_transfer /home/jsoumagne/workspace/mercury/source/src/mercury_bulk.c:1486 (libmercury_debug.so.0.9.1+0x00000001a34b)
    #8 hg_test_bulk_write_thread_cb /home/jsoumagne/workspace/mercury/source/Testing/mercury_rpc_cb.c:364 (hg_test_server+0x000000406b1b)
    #9 hg_test_bulk_write_thread /home/jsoumagne/workspace/mercury/source/Testing/mercury_rpc_cb.c:1279 (hg_test_server+0x0000004087b2)
    #10 hg_thread_pool_worker /home/jsoumagne/workspace/mercury/source/src/util/mercury_thread_pool.c:62 (libmercury_util_debug.so.0.9.1+0x0000000075dd)
    #11 <null> <null> (libtsan.so.0+0x000000024549)

  Location is heap block of size 262144 at 0x7f4ff2c1c000 allocated by main thread:
    #0 calloc <null> (libtsan.so.0+0x000000026da1)
    #1 ofi_rbinit include/fi_rbuf.h:121 (libfabric.so.1+0x00000004f64e)
    #2 na_ofi_endpoint_open /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1341 (libna_debug.so.0.9.1+0x0000000219bb)
    #3 na_ofi_initialize /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1721 (libna_debug.so.0.9.1+0x000000022e7c)
    #4 NA_Initialize /home/jsoumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000006f12)
    #5 NA_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409f96)
    #6 HG_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/mercury_test.c:374 (hg_test_server+0x000000405cf1)
    #7 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004037e8)

  Mutex M37 (0x7d480000b098) created at:
    #0 pthread_spin_init <null> (libtsan.so.0+0x00000002911b)
    #1 fastlock_init include/fi_lock.h:85 (libfabric.so.1+0x00000003d7af)
    #2 na_ofi_endpoint_open /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1384 (libna_debug.so.0.9.1+0x000000021c0e)
    #3 na_ofi_initialize /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1721 (libna_debug.so.0.9.1+0x000000022e7c)
    #4 NA_Initialize /home/jsoumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000006f12)
    #5 NA_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409f96)
    #6 HG_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/mercury_test.c:374 (hg_test_server+0x000000405cf1)
    #7 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004037e8)

  Mutex M24 (0x7f4ff7d17c10) created at:
    #0 pthread_spin_init <null> (libtsan.so.0+0x00000002911b)
    #1 fastlock_init include/fi_lock.h:85 (libfabric.so.1+0x000000050daf)
    #2 na_ofi_domain_open /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1124 (libna_debug.so.0.9.1+0x000000020df5)
    #3 na_ofi_initialize /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1712 (libna_debug.so.0.9.1+0x000000022db9)
    #4 NA_Initialize /home/jsoumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000006f12)
    #5 NA_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409f96)
    #6 HG_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/mercury_test.c:374 (hg_test_server+0x000000405cf1)
    #7 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004037e8)

  Mutex M29 (0x7d4c0000d7c8) created at:
    #0 pthread_spin_init <null> (libtsan.so.0+0x00000002911b)
    #1 fastlock_init include/fi_lock.h:85 (libfabric.so.1+0x00000004f3d3)
    #2 na_ofi_endpoint_open /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1341 (libna_debug.so.0.9.1+0x0000000219bb)
    #3 na_ofi_initialize /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1721 (libna_debug.so.0.9.1+0x000000022e7c)
    #4 NA_Initialize /home/jsoumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000006f12)
    #5 NA_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409f96)
    #6 HG_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/mercury_test.c:374 (hg_test_server+0x000000405cf1)
    #7 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004037e8)

  Mutex M30 (0x7d4c0000d7c0) created at:
    #0 pthread_spin_init <null> (libtsan.so.0+0x00000002911b)
    #1 fastlock_init include/fi_lock.h:85 (libfabric.so.1+0x00000004f3d3)
    #2 na_ofi_endpoint_open /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1341 (libna_debug.so.0.9.1+0x0000000219bb)
    #3 na_ofi_initialize /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:1721 (libna_debug.so.0.9.1+0x000000022e7c)
    #4 NA_Initialize /home/jsoumagne/workspace/mercury/source/src/na/na.c:363 (libna_debug.so.0.9.1+0x000000006f12)
    #5 NA_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/na/na_test.c:463 (hg_test_server+0x000000409f96)
    #6 HG_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/mercury_test.c:374 (hg_test_server+0x000000405cf1)
    #7 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004037e8)

  Thread T10 (tid=15800, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x0000000284b0)
    #1 hg_thread_create /home/jsoumagne/workspace/mercury/source/src/util/mercury_thread.c:35 (libmercury_util_debug.so.0.9.1+0x000000006829)
    #2 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:68 (hg_test_server+0x00000040380d)

  Thread T2 (tid=15792, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x0000000284b0)
    #1 hg_thread_create /home/jsoumagne/workspace/mercury/source/src/util/mercury_thread.c:35 (libmercury_util_debug.so.0.9.1+0x000000006829)
    #2 hg_thread_pool_init /home/jsoumagne/workspace/mercury/source/src/util/mercury_thread_pool.c:117 (libmercury_util_debug.so.0.9.1+0x0000000078d9)
    #3 HG_Test_server_init /home/jsoumagne/workspace/mercury/source/Testing/mercury_test.c:382 (hg_test_server+0x000000405d2d)
    #4 main /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:65 (hg_test_server+0x0000004037e8)

SUMMARY: ThreadSanitizer: data race (/lib64/libtsan.so.0+0x27351) in __interceptor_memcpy

shefty commented 7 years ago

thanks for the report - we'll look into the locking

shefty commented 7 years ago

The ofi_rbxxx() code assumes that locking is provided by the caller. I'm having trouble converting the stack traces into source code. For example, what is na_ofi.c and mercury_core.c? Are these part of the thread sanitizer or a third party (not upstream) provider?

shefty commented 7 years ago

Okay, I think I found those: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c. So, we don't have trace points for the libfabric code. Anyway, the 2 calls that you mentioned should be enough for me to figure out if there's a missing lock.

shefty commented 7 years ago

I spent quite a bit of time analyzing the socket code. The ring buffer that's used with the CQ appears to have the proper locking around it. No obvious problems hit me based on code inspection. (I'm looking at the tip of master. If this is from a specific version, please let me know which version. There may have been a locking issue in an earlier release that has since been fixed.)

For the fi_readv() call, the ring buffer that is accessed is separate from the CQ ring buffer. The transmit context ring buffer is protected by 2 locks. One serializes reading the ring buffer, while the other serializes writes. A reader and writer may access the ring buffer simultaneously. They each update separate variables. Maybe this is what the thread analyzer is warning against?

If you can provide stack traces that go further into the code, and a link to the mercury code shown in the stack trace (so that the line numbers align), it may help isolate the problem or confirm if the warning can be ignored. Thanks!

soumagne commented 7 years ago

Sorry I had somehow missed your reply, there is more stack trace than necessary and it shows calls in our layer but na_ofi.c is our transport plugin that uses libfabric. I just kept the whole stack trace to make it easier to understand (you got the correct source file by the way). Sorry I had left 2 extra lines of debug code, which is why it was not exactly matching. ofi_rbpeek is called by fi_cq_read line 3292 of na_ofi.c and ofi_rbwrite is called by fi_readv line 3007 of na_ofi.c.

soumagne commented 7 years ago

I've removed my debug code and ran the trace again so that line numbers exactly match, I've updated the original trace. I can't seem to be able to print more stuff from the OFI stack though, I'm passing -g -O0... I'm trying to see if I can get more information though.

soumagne commented 7 years ago

Well for some reason I was not able to get more context using the GCC thread sanitizer so had to go to intel inspector which was able to give me more context, hopefully that helps. In that case, the problem appears between an fi_tsend and an fi_cq_read but I assume it is basically the same issue reported as for the fi_readv. Below is the full trace.

/home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h(165): Error X59: Write: Function ofi_rbwrite: Module /opt/ofi/git/lib/libfabric.so.1
  Code snippet:
   163      endlen = rb->size - (rb->wpos & rb->size_mask);
   164      if (len <= endlen) {
  >165          memcpy((char*)rb->buf + (rb->wpos & rb->size_mask), buf, len);
   166      } else {
   167          memcpy((char*)rb->buf + (rb->wpos & rb->size_mask), buf, endlen);

  Stack (1 of 1 instance(s))
  >libfabric.so.1!ofi_rbwrite - /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h:165
   libfabric.so.1!sock_tx_ctx_write - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_ctx.c:167
   libfabric.so.1!sock_ep_tsendmsg - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_msg.c:614
   libfabric.so.1!sock_ep_tsend - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_msg.c:643
   libna_debug.so.0.9.1!fi_tsend - /opt/ofi/git/include/rdma/fi_tagged.h:116
   libna_debug.so.0.9.1!na_ofi_msg_send_expected - /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:2568
   libna_debug.so.0.9.1!NA_Msg_send_expected - /home/jsoumagne/workspace/mercury/source/src/na/na.c:1223
   libmercury_debug.so.0.9.1!hg_core_respond_na - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:1784
   libmercury_debug.so.0.9.1!HG_Core_respond - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:3982
   libmercury_debug.so.0.9.1!HG_Respond - /home/jsoumagne/workspace/mercury/source/src/mercury.c:1424
   hg_test_server!hg_test_bulk_transfer_cb - /home/jsoumagne/workspace/mercury/source/Testing/mercury_rpc_cb.c:310
   libmercury_debug.so.0.9.1!hg_bulk_trigger_entry - /home/jsoumagne/workspace/mercury/source/src/mercury_bulk.c:899
   libmercury_debug.so.0.9.1!hg_core_trigger - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:2639
   libmercury_debug.so.0.9.1!HG_Core_trigger - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:4031
   libmercury_debug.so.0.9.1!HG_Trigger - /home/jsoumagne/workspace/mercury/source/src/mercury.c:1447
   hg_test_server!main - /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:74
   hg_test_server!_start - /home/jsoumagne/workspace/mercury/build/bin/hg_test_server:0x31c4

  /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h(189): Error X60: Read: Function ofi_rbpeek: Module /opt/ofi/git/lib/libfabric.so.1
  Code snippet:
   187      endlen = rb->size - (rb->rcnt & rb->size_mask);
   188      if (len <= endlen) {
  >189          memcpy(buf, (char*)rb->buf + (rb->rcnt & rb->size_mask), len);
   190      } else {
   191          memcpy(buf, (char*)rb->buf + (rb->rcnt & rb->size_mask), endlen);

  Stack (1 of 1 instance(s))
  >libfabric.so.1!ofi_rbpeek - /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h:189
   libfabric.so.1!ofi_rbread - /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h:198
   libfabric.so.1!sock_pe_new_tx_entry - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_progress.c:2145
   libfabric.so.1!sock_pe_progress_tx_ctx - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_progress.c:2481
   libfabric.so.1!sock_cq_progress - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:123
   libfabric.so.1!sock_cq_sreadfrom - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:358
   libfabric.so.1!sock_cq_readfrom - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:407
   libfabric.so.1!sock_cq_read - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:412
   libna_debug.so.0.9.1!fi_cq_read - /opt/ofi/git/include/rdma/fi_eq.h:385
   libna_debug.so.0.9.1!na_ofi_progress - /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:3292
   libna_debug.so.0.9.1!NA_Progress - /home/jsoumagne/workspace/mercury/source/src/na/na.c:1829
   libmercury_debug.so.0.9.1!hg_core_progress_na - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:2474
   libmercury_debug.so.0.9.1!HG_Core_progress - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:4008
   libmercury_debug.so.0.9.1!HG_Progress - /home/jsoumagne/workspace/mercury/source/src/mercury.c:1439
   hg_test_server!hg_progress_thread - /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:42

shefty commented 7 years ago

Thanks - this is helpful for identifying where it thinks there's a problem.

The ring buffer uses 2 variables to allow parallel reading and writing of the buffer. In the code snippets, these show up as wcnt and rcnt. Each cnt is protected by a separate lock. The location in the ring buffer being accessed should be different. Unless there's something off with the logic, I think the code is okay. Are these just warnings coming from the tools, or are you seeing an actual problem with the app?

soumagne commented 7 years ago

No I haven't noticed any app issue so far, just warnings coming from the thread sanitizer that I see in our nightly tests so I was mainly wondering if that is something that could be a potential problem (it would probably be relatively hard to detect if that is the cause of any issue) and was wondering/trying to understand why it would throw these ones. Usually gcc's thread sanitizer recognizes stdatomic etc.

shefty commented 7 years ago

I can completely see how a analyzer could see this as a potential issue. Access to the ring buffer 'buf' is possible by two threads. However, they should access different portions of the buffer based on the wcnt and rcnt values. I'm not sure an analyzer can detect those being different. At least from the reports, that's my guess as to what's being reported.

If the analyzer is indicating that the same address is being read simultaneously while it's being written that would be an issue and show a flaw in how wcnt and rcnt are being updated. But reading the initial report, it looks like the complaint is that it detected a read after write to the same address, with disjoint mutexes. The use of separate mutexes is intentional.

soumagne commented 7 years ago

Alright I think I found what the problem is. I could not figure out by simply looking at the code so I created a reproducer for it and copied/pasted parts of the code. It looks like the problem comes from a missing fastlock_acquire around the check for ofi_rbempty in sock_pe_progress_tx_ctx. There should be a fastlock_acquire(&tx_ctx->wlock); as well because it checks for the value of wcnt there, which is modified in ofi_rbcommit but ofi_rbcommit is called in sock_tx_ctx_commit where there is only a fastlock_acquire of tx_ctx->wlock from the sock_ep_tsendmsg call.

Anyway attached is the reproducer ofi_rbuf_test.txt, I had to make a .txt file to be able to attach it to github sorry. Compile it with gcc -O0 -g -fsanitize=thread and run it to see the exact same type of errors coming from the sanitizer, then uncomment lines 136 and 138, recompile and the sanitizer warnings go away so unless I missed something I believe the same fix should be applied in the sock_pe_progress_tx_ctx function.

shefty commented 7 years ago

The wlock provides write protection against wcnt. Likewise rlock provides write protection against rcnt. The assumption is that it's okay to read wcnt or rcnt without holding either lock. The surrounding code must be correct whether it sees the old value or new value. That is, any race in reading rcnt or wcnt is there regardless of locks being taken wrt a writer. The reader could just as easily checked the value prior to it being updated as after.

The rbempty() reads wcnt to see if a command has been written into the ring buffer. Either the reader will see the old or new wcnt value. In the first case, the reader simply continues. In the latter case, the new data is processed by the reader.

Maybe the assumption that a reader can't access the value of wcnt or rcnt without serializing with a writer is wrong. But the use still looks okay to me. In any case, I'm posting this thread to the mailing list in order to get input from other developers on whether the code is safe as written, or the thread sanitizer is identifying a real problem. Thanks for identifying the cause of the report!

soumagne commented 7 years ago

Yes I agree with that and it is probably okay as far as I can think of it although the cleanest way of doing it would be to use atomics to be honest and get rid of the locks :) The thing that is annoying is that this triggers dozens of sanitizer warnings in my code and I had no idea where they were coming from, see https://cdash.hdfgroup.org/viewDynamicAnalysisFile.php?id=4860 for example.

Anyway for reference I just tried on the socket code itself this time and all the sanitizer warnings I was seeing go away after that fix so at least that means there does not seem to be any other place in the code for now where that issue shows up.

--- a/prov/sockets/src/sock_progress.c
+++ b/prov/sockets/src/sock_progress.c
@@ -2460,6 +2460,7 @@ int sock_pe_progress_tx_ctx(struct sock_pe *pe, struct sock_tx_ctx *tx_ctx)
        int ret = 0;
        struct dlist_entry *entry;
        struct sock_pe_entry *pe_entry;
+        int rb_empty;

        fastlock_acquire(&pe->lock);

@@ -2477,7 +2478,10 @@ int sock_pe_progress_tx_ctx(struct sock_pe *pe, struct sock_tx_ctx *tx_ctx)
        }

        fastlock_acquire(&tx_ctx->rlock);
-       if (!ofi_rbempty(&tx_ctx->rb) && !dlist_empty(&pe->free_list)) {
+        fastlock_acquire(&tx_ctx->wlock);
+        rb_empty = ofi_rbempty(&tx_ctx->rb);
+        fastlock_release(&tx_ctx->wlock);
+       if (!rb_empty && !dlist_empty(&pe->free_list)) {
                ret = sock_pe_new_tx_entry(pe, tx_ctx);
        }
        fastlock_release(&tx_ctx->rlock);

Thanks!

shefty commented 7 years ago

response from maill list:

For reference, the ring buffer is written to allow simultaneous access reading and writing to the ring buffer. Serialization of readers and writers is left up to the caller.

reading from memory being written to by another CPU without locking must go through stdatomic.

The problem with the code as written is this:

    pthread_spin_lock(&my_ringbuf_ptr->wlock);
    ofi_rbwrite(&my_ringbuf_ptr->ringbuf, &data,
    sizeof(int));
    ofi_rbcommit(&my_ringbuf_ptr->ringbuf);
    pthread_spin_unlock(&my_ringbuf_ptr->wlock);

To work properly the store to rb->wcnt must be observed after the writes in rbwrite.

The spin lock does not guarentee this, it only guarentees that another thread that also grabs the lock will see the above happen atomically.

In other words on the read side, we have:

// pthread_spin_lock(&my_ringbuf_ptr->wlock); empty = ofi_rbempty(&my_ringbuf_ptr->ringbuf); // pthread_spin_unlock(&my_ringbuf_ptr->wlock);

And, as written, it is allowed for this CPU to observe rb->wcnt being updated before ofi_rbwrite has happened, and this would be a malfunction.

You must either wrap all reads and writes in a lock to gain serialization, or you must use stdatomic for the store and load sides of the cnt values with an appropriate strength ordering mode.

I'm not sure what you 'fastlock' is these days, but it should also be expressed in terms of stdatomic, see how I reworked the similar code in rdma-core.

Jason

shefty commented 7 years ago

I agree with Jason's analysis, so will look at fixing this. The most obvious, straightforward fix would be to just use a single lock in the socket provider.

soumagne commented 7 years ago

Thanks for the follow-up! I think in any case if you could get rid of the locking in progress it would also just be better performance-wise.

shefty commented 7 years ago

Asking @vkrishna to look at this. He will be taking over maintaining the socket provider over time.

ofiwg / libfabric

Data races in socket provider #3303