ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
582 stars 386 forks source link

prov/verbs: Data race in vrb_open_ep function (Libfabric v1.22.0) #10569

Open piotrchmiel opened 4 days ago

piotrchmiel commented 4 days ago

Describe the bug A data race occurs in the vrb_open_ep function in the verbs provider when creating multiple endpoints using multiple threads. The race is detected by Thread Sanitizer and involves concurrent modifications of the global variable vrb_ep_ops. The issue causes unpredictable behavior when using the FI_THREAD_SAFE threading model, which should guarantee safe concurrent operations.

To Reproduce Steps to reproduce the behavior:

  1. Set up an environment with Libfabric 1.22.0, Ubuntu 22.04, and the verbs provider.
  2. Use the FI_THREAD_SAFE threading mode.
  3. Create multiple endpoints (fi_endpoint) from multiple threads simultaneously.
  4. Observe the data race using Thread Sanitizer.

Expected behavior The FI_THREAD_SAFE threading mode should ensure thread safety, allowing multiple threads to create endpoints concurrently without encountering data races.

Output Thread Sanitizer detects the following data race:

WARNING: ThreadSanitizer: data race (pid=76726)
  Write of size 8 at 0x7ff67d69a3b8 by thread T2:
    #0 vrb_open_ep /path/to/libfabric/prov/verbs/src/verbs_ep.c:1397:31 (libfabric.so.1+0x100f27)
    #1 fi_endpoint(fid_domain*, fi_info*, fid_ep**, void*) /path/to/libfabric/include/rdma/fi_endpoint.h:187:9 (application+0x6658b0)

 Previous write of size 8 at 0x7ff67d69a3b8 by thread T4:
    #0 vrb_open_ep /path/to/libfabric/prov/verbs/src/verbs_ep.c:1397:31 (libfabric.so.1+0x100f27)
    #1 fi_endpoint(fid_domain*, fi_info*, fid_ep**, void*) /path/to/libfabric/include/rdma/fi_endpoint.h:187:9 (application+0x6658b0)

The data race occurs at: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1397

This line modifies the global variable vrb_ep_ops:

(*ep_fid)->fid.ops->ops_open = vrb_ep_ops_open;

The global variable vrb_ep_ops is defined here: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1159

The issue was introduced in commit: https://github.com/ofiwg/libfabric/commit/da62d0f79ea7ced3ae81acd90347733c696bf8e8

Environment:

Additional context The issue appears to arise because vrb_ep_ops is a global variable shared across threads, and the modifications are not protected by a mutex or any thread-synchronization mechanism.

This breaks the FI_THREAD_SAFE threading model, where thread safety is expected when using multiple threads concurrently.

Potential Fix: Synchronize access to vrb_ep_ops using a mutex or move to a per-instance structure to avoid shared mutable state.

sydidelot commented 4 days ago

@piotrchmiel Thanks for the bug report! I guess this issue can easily be fixed with the following patch:

diff --git a/prov/verbs/src/verbs_ep.c b/prov/verbs/src/verbs_ep.c
index 63aea8277..ebd7646d9 100644
--- a/prov/verbs/src/verbs_ep.c
+++ b/prov/verbs/src/verbs_ep.c
@@ -1161,7 +1161,7 @@ static struct fi_ops vrb_ep_ops = {
        .close = vrb_ep_close,
        .bind = vrb_ep_bind,
        .control = vrb_ep_control,
-       .ops_open = fi_no_ops_open,
+       .ops_open = vrb_ep_ops_open,
 };

 static struct fi_ops_cm vrb_dgram_cm_ops = {
@@ -1394,7 +1394,6 @@ int vrb_open_ep(struct fid_domain *domain, struct fi_info *info,
        *ep_fid = &ep->util_ep.ep_fid;
        ep->util_ep.ep_fid.fid.ops = &vrb_ep_ops;
        ep->util_ep.ep_fid.ops = &vrb_ep_base_ops;
-       (*ep_fid)->fid.ops->ops_open = vrb_ep_ops_open;

        vrb_prof_func_end("vrb_open_ep");