Closed soumagne closed 7 years ago
thanks for the report - we'll look into the locking
The ofi_rbxxx() code assumes that locking is provided by the caller. I'm having trouble converting the stack traces into source code. For example, what is na_ofi.c and mercury_core.c? Are these part of the thread sanitizer or a third party (not upstream) provider?
Okay, I think I found those: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c. So, we don't have trace points for the libfabric code. Anyway, the 2 calls that you mentioned should be enough for me to figure out if there's a missing lock.
I spent quite a bit of time analyzing the socket code. The ring buffer that's used with the CQ appears to have the proper locking around it. No obvious problems hit me based on code inspection. (I'm looking at the tip of master. If this is from a specific version, please let me know which version. There may have been a locking issue in an earlier release that has since been fixed.)
For the fi_readv() call, the ring buffer that is accessed is separate from the CQ ring buffer. The transmit context ring buffer is protected by 2 locks. One serializes reading the ring buffer, while the other serializes writes. A reader and writer may access the ring buffer simultaneously. They each update separate variables. Maybe this is what the thread analyzer is warning against?
If you can provide stack traces that go further into the code, and a link to the mercury code shown in the stack trace (so that the line numbers align), it may help isolate the problem or confirm if the warning can be ignored. Thanks!
Sorry I had somehow missed your reply, there is more stack trace than necessary and it shows calls in our layer but na_ofi.c is our transport plugin that uses libfabric. I just kept the whole stack trace to make it easier to understand (you got the correct source file by the way). Sorry I had left 2 extra lines of debug code, which is why it was not exactly matching. ofi_rbpeek
is called by fi_cq_read
line 3292 of na_ofi.c and ofi_rbwrite
is called by fi_readv
line 3007 of na_ofi.c.
I've removed my debug code and ran the trace again so that line numbers exactly match, I've updated the original trace. I can't seem to be able to print more stuff from the OFI stack though, I'm passing -g -O0... I'm trying to see if I can get more information though.
Well for some reason I was not able to get more context using the GCC thread sanitizer so had to go to intel inspector which was able to give me more context, hopefully that helps. In that case, the problem appears between an fi_tsend and an fi_cq_read but I assume it is basically the same issue reported as for the fi_readv. Below is the full trace.
/home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h(165): Error X59: Write: Function ofi_rbwrite: Module /opt/ofi/git/lib/libfabric.so.1
Code snippet:
163 endlen = rb->size - (rb->wpos & rb->size_mask);
164 if (len <= endlen) {
>165 memcpy((char*)rb->buf + (rb->wpos & rb->size_mask), buf, len);
166 } else {
167 memcpy((char*)rb->buf + (rb->wpos & rb->size_mask), buf, endlen);
Stack (1 of 1 instance(s))
>libfabric.so.1!ofi_rbwrite - /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h:165
libfabric.so.1!sock_tx_ctx_write - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_ctx.c:167
libfabric.so.1!sock_ep_tsendmsg - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_msg.c:614
libfabric.so.1!sock_ep_tsend - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_msg.c:643
libna_debug.so.0.9.1!fi_tsend - /opt/ofi/git/include/rdma/fi_tagged.h:116
libna_debug.so.0.9.1!na_ofi_msg_send_expected - /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:2568
libna_debug.so.0.9.1!NA_Msg_send_expected - /home/jsoumagne/workspace/mercury/source/src/na/na.c:1223
libmercury_debug.so.0.9.1!hg_core_respond_na - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:1784
libmercury_debug.so.0.9.1!HG_Core_respond - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:3982
libmercury_debug.so.0.9.1!HG_Respond - /home/jsoumagne/workspace/mercury/source/src/mercury.c:1424
hg_test_server!hg_test_bulk_transfer_cb - /home/jsoumagne/workspace/mercury/source/Testing/mercury_rpc_cb.c:310
libmercury_debug.so.0.9.1!hg_bulk_trigger_entry - /home/jsoumagne/workspace/mercury/source/src/mercury_bulk.c:899
libmercury_debug.so.0.9.1!hg_core_trigger - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:2639
libmercury_debug.so.0.9.1!HG_Core_trigger - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:4031
libmercury_debug.so.0.9.1!HG_Trigger - /home/jsoumagne/workspace/mercury/source/src/mercury.c:1447
hg_test_server!main - /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:74
hg_test_server!_start - /home/jsoumagne/workspace/mercury/build/bin/hg_test_server:0x31c4
/home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h(189): Error X60: Read: Function ofi_rbpeek: Module /opt/ofi/git/lib/libfabric.so.1
Code snippet:
187 endlen = rb->size - (rb->rcnt & rb->size_mask);
188 if (len <= endlen) {
>189 memcpy(buf, (char*)rb->buf + (rb->rcnt & rb->size_mask), len);
190 } else {
191 memcpy(buf, (char*)rb->buf + (rb->rcnt & rb->size_mask), endlen);
Stack (1 of 1 instance(s))
>libfabric.so.1!ofi_rbpeek - /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h:189
libfabric.so.1!ofi_rbread - /home/jsoumagne/workspace/libfabric/./include/fi_rbuf.h:198
libfabric.so.1!sock_pe_new_tx_entry - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_progress.c:2145
libfabric.so.1!sock_pe_progress_tx_ctx - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_progress.c:2481
libfabric.so.1!sock_cq_progress - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:123
libfabric.so.1!sock_cq_sreadfrom - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:358
libfabric.so.1!sock_cq_readfrom - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:407
libfabric.so.1!sock_cq_read - /home/jsoumagne/workspace/libfabric/prov/sockets/src/sock_cq.c:412
libna_debug.so.0.9.1!fi_cq_read - /opt/ofi/git/include/rdma/fi_eq.h:385
libna_debug.so.0.9.1!na_ofi_progress - /home/jsoumagne/workspace/mercury/source/src/na/na_ofi.c:3292
libna_debug.so.0.9.1!NA_Progress - /home/jsoumagne/workspace/mercury/source/src/na/na.c:1829
libmercury_debug.so.0.9.1!hg_core_progress_na - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:2474
libmercury_debug.so.0.9.1!HG_Core_progress - /home/jsoumagne/workspace/mercury/source/src/mercury_core.c:4008
libmercury_debug.so.0.9.1!HG_Progress - /home/jsoumagne/workspace/mercury/source/src/mercury.c:1439
hg_test_server!hg_progress_thread - /home/jsoumagne/workspace/mercury/source/Testing/test_server.c:42
Thanks - this is helpful for identifying where it thinks there's a problem.
The ring buffer uses 2 variables to allow parallel reading and writing of the buffer. In the code snippets, these show up as wcnt and rcnt. Each cnt is protected by a separate lock. The location in the ring buffer being accessed should be different. Unless there's something off with the logic, I think the code is okay. Are these just warnings coming from the tools, or are you seeing an actual problem with the app?
No I haven't noticed any app issue so far, just warnings coming from the thread sanitizer that I see in our nightly tests so I was mainly wondering if that is something that could be a potential problem (it would probably be relatively hard to detect if that is the cause of any issue) and was wondering/trying to understand why it would throw these ones. Usually gcc's thread sanitizer recognizes stdatomic etc.
I can completely see how a analyzer could see this as a potential issue. Access to the ring buffer 'buf' is possible by two threads. However, they should access different portions of the buffer based on the wcnt and rcnt values. I'm not sure an analyzer can detect those being different. At least from the reports, that's my guess as to what's being reported.
If the analyzer is indicating that the same address is being read simultaneously while it's being written that would be an issue and show a flaw in how wcnt and rcnt are being updated. But reading the initial report, it looks like the complaint is that it detected a read after write to the same address, with disjoint mutexes. The use of separate mutexes is intentional.
Alright I think I found what the problem is. I could not figure out by simply looking at the code so I created a reproducer for it and copied/pasted parts of the code. It looks like the problem comes from a missing fastlock_acquire
around the check for ofi_rbempty
in sock_pe_progress_tx_ctx
. There should be a fastlock_acquire(&tx_ctx->wlock);
as well because it checks for the value of wcnt there, which is modified in ofi_rbcommit
but ofi_rbcommit
is called in sock_tx_ctx_commit
where there is only a fastlock_acquire
of tx_ctx->wlock
from the sock_ep_tsendmsg
call.
Anyway attached is the reproducer ofi_rbuf_test.txt, I had to make a .txt file to be able to attach it to github sorry. Compile it with gcc -O0 -g -fsanitize=thread
and run it to see the exact same type of errors coming from the sanitizer, then uncomment lines 136 and 138, recompile and the sanitizer warnings go away so unless I missed something I believe the same fix should be applied in the sock_pe_progress_tx_ctx
function.
The wlock provides write protection against wcnt. Likewise rlock provides write protection against rcnt. The assumption is that it's okay to read wcnt or rcnt without holding either lock. The surrounding code must be correct whether it sees the old value or new value. That is, any race in reading rcnt or wcnt is there regardless of locks being taken wrt a writer. The reader could just as easily checked the value prior to it being updated as after.
The rbempty() reads wcnt to see if a command has been written into the ring buffer. Either the reader will see the old or new wcnt value. In the first case, the reader simply continues. In the latter case, the new data is processed by the reader.
Maybe the assumption that a reader can't access the value of wcnt or rcnt without serializing with a writer is wrong. But the use still looks okay to me. In any case, I'm posting this thread to the mailing list in order to get input from other developers on whether the code is safe as written, or the thread sanitizer is identifying a real problem. Thanks for identifying the cause of the report!
Yes I agree with that and it is probably okay as far as I can think of it although the cleanest way of doing it would be to use atomics to be honest and get rid of the locks :) The thing that is annoying is that this triggers dozens of sanitizer warnings in my code and I had no idea where they were coming from, see https://cdash.hdfgroup.org/viewDynamicAnalysisFile.php?id=4860 for example.
Anyway for reference I just tried on the socket code itself this time and all the sanitizer warnings I was seeing go away after that fix so at least that means there does not seem to be any other place in the code for now where that issue shows up.
--- a/prov/sockets/src/sock_progress.c
+++ b/prov/sockets/src/sock_progress.c
@@ -2460,6 +2460,7 @@ int sock_pe_progress_tx_ctx(struct sock_pe *pe, struct sock_tx_ctx *tx_ctx)
int ret = 0;
struct dlist_entry *entry;
struct sock_pe_entry *pe_entry;
+ int rb_empty;
fastlock_acquire(&pe->lock);
@@ -2477,7 +2478,10 @@ int sock_pe_progress_tx_ctx(struct sock_pe *pe, struct sock_tx_ctx *tx_ctx)
}
fastlock_acquire(&tx_ctx->rlock);
- if (!ofi_rbempty(&tx_ctx->rb) && !dlist_empty(&pe->free_list)) {
+ fastlock_acquire(&tx_ctx->wlock);
+ rb_empty = ofi_rbempty(&tx_ctx->rb);
+ fastlock_release(&tx_ctx->wlock);
+ if (!rb_empty && !dlist_empty(&pe->free_list)) {
ret = sock_pe_new_tx_entry(pe, tx_ctx);
}
fastlock_release(&tx_ctx->rlock);
Thanks!
response from maill list:
For reference, the ring buffer is written to allow simultaneous access reading and writing to the ring buffer. Serialization of readers and writers is left up to the caller.
reading from memory being written to by another CPU without locking must go through stdatomic.
The problem with the code as written is this:
pthread_spin_lock(&my_ringbuf_ptr->wlock);
ofi_rbwrite(&my_ringbuf_ptr->ringbuf, &data,
sizeof(int));
ofi_rbcommit(&my_ringbuf_ptr->ringbuf);
pthread_spin_unlock(&my_ringbuf_ptr->wlock);
To work properly the store to rb->wcnt must be observed after the writes in rbwrite.
The spin lock does not guarentee this, it only guarentees that another thread that also grabs the lock will see the above happen atomically.
In other words on the read side, we have:
// pthread_spin_lock(&my_ringbuf_ptr->wlock); empty = ofi_rbempty(&my_ringbuf_ptr->ringbuf); // pthread_spin_unlock(&my_ringbuf_ptr->wlock);
And, as written, it is allowed for this CPU to observe rb->wcnt being updated before ofi_rbwrite has happened, and this would be a malfunction.
You must either wrap all reads and writes in a lock to gain serialization, or you must use stdatomic for the store and load sides of the cnt values with an appropriate strength ordering mode.
I'm not sure what you 'fastlock' is these days, but it should also be expressed in terms of stdatomic, see how I reworked the similar code in rdma-core.
Jason
I agree with Jason's analysis, so will look at fixing this. The most obvious, straightforward fix would be to just use a single lock in the socket provider.
Thanks for the follow-up! I think in any case if you could get rid of the locking in progress it would also just be better performance-wise.
Asking @vkrishna to look at this. He will be taking over maintaining the socket provider over time.
When using the thread sanitizer with the socket provider, we're getting several warnings coming from the libfabric library. Looking at the code it seems that
ofi_rbpeek
/ofi_rbwrite
is not protected? Note that in that particular example, we have one thread doing progress (i.e. callingfi_cq_read()
) and the other thread is callingfi_readv()
. There are some more warnings but I'll go one at a time for now. Also thread safety requested in our code isFI_THREAD_UNSPEC
, which it looks like defaults toFI_THREAD_SAFE
anyway.(For reference: NA OFI code from the stack trace is located here)