Closed sithhell closed 7 years ago
The psm2 provider uses FI_CONTEXT, which is set as an fi_info mode bit. That requires that the user context passed into any data transfer operation be a pointer to struct fi_context. That context is for use by the provider, and allows the provider to avoid allocating a separate per operation context.
Is the app specifying support for FI_CONTEXT (i.e. setting fi_info->mode |= FI_CONTEXT) and using the API correctly when set?
The psm2 provider uses FI_CONTEXT, which is set as an fi_info mode bit. That requires that the user context passed into any data transfer operation be a pointer to struct fi_context. That context is for use by the provider, and allows the provider to avoid allocating a separate per operation context.
Ok, I was not aware of that. This also means that my application needs to perform a lookup to associate the passed fi_context
to my application provided context, correct?
Is the app specifying support for FI_CONTEXT (i.e. setting fi_info->mode |= FI_CONTEXT) and using the API correctly when set?
No it was not. The number one reason for that is the use case I initially described. IIUC, this would render this usecase unusable and I have to maintain an additional application logic to achieve my initial goal. Is this also compatible with the GNI provider?
Would you accept a patch that extends fi_context to have an additional void *context
(or whichever name you prefer) pointer that would allow to carry additional application specific information?
The patch is fairly simple:
diff --git a/include/rdma/fabric.h b/include/rdma/fabric.h
index 956cd4e..206baf1 100644
--- a/include/rdma/fabric.h
+++ b/include/rdma/fabric.h
@@ -575,10 +575,12 @@ void fi_freeparams(struct fi_param *params);
#ifndef FABRIC_DIRECT_
struct fi_context {
void *internal[4];
+ void *context;
};
struct fi_context2 {
void *internal[8];
+ void *context;
};
#endif
Would you accept a patch that extends fi_context to have an additional void *context (or whichever name you prefer) pointer that would allow to carry additional application specific information?
The way apps typically handle fi_context is to embed it into their own context structures. Then use container_of() to cast back. E.g.
sruct my_context { struct fi_context ofi_context; void my_user_context; void user_buffer; size_t user_transfer_size; ... };
Since many libfabric users require their own context structure, it's usually trivial to include the fi_context structure as well. Doing this eliminates the provider to allocating/returning the data structure from an internal pool.
Am 24.07.2017 6:34 nachm. schrieb "Sean Hefty" notifications@github.com:
Would you accept a patch that extends fi_context to have an additional void *context (or whichever name you prefer) pointer that would allow to carry additional application specific information?
The way apps typically handle fi_context is to embed it into their own context structures. Then use container_of() to cast back. E.g.
I have a problem with container_of not working with pure virtual base classes (c++, due to constraints on offsetof). I use a cast to a base pointer, then call it's virtual function to dispatch to the correct handler. Using container_of in that scenario is undefined. I'll to have to ponder a bit on how I'll make this work if you don't want to include the explicit back pointer.
sruct my_context { struct fi_context ofi_context; void my_user_context; void user_buffer; size_t user_transfer_size; ... };
Since many libfabric users require their own context structure, it's usually trivial to include the fi_context structure as well. Doing this eliminates the provider to allocating/returning the data structure from an internal pool.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ofiwg/libfabric/issues/3150#issuecomment-317479437, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRlwB0V0IbTjDOtCprkVSAKKGR_P7_ks5sRMeggaJpZM4OfW5_ .
The problem with trying to modify struct fi_context is that it breaks applications. We can't make that change.
You don't necessarily need to use container_of(). I just put struct fi_context
at the top of my data structures and then ignore it forever.
You don't necessarily need to use container_of(). I just put struct fi_context at the top of my data structures and then ignore it forever.
This has the same problems (in my case) than offsetof.
What I'll do is to use your suggestion though and just create another layer of indirection. Just for reference, the application context will look like this:
struct receiver : struct some_base
{
receiver() { context_.this_ = this; };
void post_receive()
{
fi_recv(endpoint, buf, len, desc, FI_ADDR_UNSPEC, &context_);
};
void handle_recv();
// members and stuff ...
struct
{
fi_context dummy_;
some_base *this_;
void handle_recv() { this_->handle_recv(); }
} context_;
};
This covers my use case nicely, and hopefully fixes the issue. Thanks for the help!
Just one more note: This is technically still UB in C++ country (due to lifetime issues, if you are interested I can explain further), but since the pointer is passed to a C library, it's just fine.
I am currently running into problems when using the PSM2 provider on an OmniPath interconnect. What I am seeing is memory corruption when doing a call to
fi_recv
using the following scheme:The reason I pass on this as the context is to handle an incoming receive completion local to that class. What I observe now, is that the object's memory is corrupted after fi_recv returned. I identified the following code as the source of error: https://github.com/ofiwg/libfabric/blob/master/prov/psm2/src/psmx2_msg.c#L114 (this sets
fi_context
to the user provided context), which isn't bad per se however, the memory gets overriden eventually: https://github.com/ofiwg/libfabric/blob/master/prov/psm2/src/psmx2_msg.c#L131-L140Any advise how to fix this? From a cursory look, other providers don't seem to have that problem.