Open piotrchmiel opened 4 days ago
@piotrchmiel Thanks for the bug report! I guess this issue can easily be fixed with the following patch:
diff --git a/prov/verbs/src/verbs_ep.c b/prov/verbs/src/verbs_ep.c
index 63aea8277..ebd7646d9 100644
--- a/prov/verbs/src/verbs_ep.c
+++ b/prov/verbs/src/verbs_ep.c
@@ -1161,7 +1161,7 @@ static struct fi_ops vrb_ep_ops = {
.close = vrb_ep_close,
.bind = vrb_ep_bind,
.control = vrb_ep_control,
- .ops_open = fi_no_ops_open,
+ .ops_open = vrb_ep_ops_open,
};
static struct fi_ops_cm vrb_dgram_cm_ops = {
@@ -1394,7 +1394,6 @@ int vrb_open_ep(struct fid_domain *domain, struct fi_info *info,
*ep_fid = &ep->util_ep.ep_fid;
ep->util_ep.ep_fid.fid.ops = &vrb_ep_ops;
ep->util_ep.ep_fid.ops = &vrb_ep_base_ops;
- (*ep_fid)->fid.ops->ops_open = vrb_ep_ops_open;
vrb_prof_func_end("vrb_open_ep");
Describe the bug A data race occurs in the vrb_open_ep function in the verbs provider when creating multiple endpoints using multiple threads. The race is detected by Thread Sanitizer and involves concurrent modifications of the global variable vrb_ep_ops. The issue causes unpredictable behavior when using the FI_THREAD_SAFE threading model, which should guarantee safe concurrent operations.
To Reproduce Steps to reproduce the behavior:
Expected behavior The FI_THREAD_SAFE threading mode should ensure thread safety, allowing multiple threads to create endpoints concurrently without encountering data races.
Output Thread Sanitizer detects the following data race:
The data race occurs at: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1397
This line modifies the global variable vrb_ep_ops:
The global variable vrb_ep_ops is defined here: https://github.com/ofiwg/libfabric/blob/v1.22.0/prov/verbs/src/verbs_ep.c#L1159
The issue was introduced in commit: https://github.com/ofiwg/libfabric/commit/da62d0f79ea7ced3ae81acd90347733c696bf8e8
Environment:
Additional context The issue appears to arise because vrb_ep_ops is a global variable shared across threads, and the modifications are not protected by a mutex or any thread-synchronization mechanism.
This breaks the FI_THREAD_SAFE threading model, where thread safety is expected when using multiple threads concurrently.
Potential Fix: Synchronize access to vrb_ep_ops using a mutex or move to a per-instance structure to avoid shared mutable state.