ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
573 stars 382 forks source link

Problems with user provided context in PSM2 provider #3150

Closed sithhell closed 7 years ago

sithhell commented 7 years ago

I am currently running into problems when using the PSM2 provider on an OmniPath interconnect. What I am seeing is memory corruption when doing a call to fi_recv using the following scheme:

struct receiver
{
    void post_receive()
    {
        fi_recv(endpoint, buf, len, desc, FI_ADDR_UNSPEC, this);
    };
    // necessary members go here...
};

The reason I pass on this as the context is to handle an incoming receive completion local to that class. What I observe now, is that the object's memory is corrupted after fi_recv returned. I identified the following code as the source of error: https://github.com/ofiwg/libfabric/blob/master/prov/psm2/src/psmx2_msg.c#L114 (this sets fi_context to the user provided context), which isn't bad per se however, the memory gets overriden eventually: https://github.com/ofiwg/libfabric/blob/master/prov/psm2/src/psmx2_msg.c#L131-L140

Any advise how to fix this? From a cursory look, other providers don't seem to have that problem.

shefty commented 7 years ago

The psm2 provider uses FI_CONTEXT, which is set as an fi_info mode bit. That requires that the user context passed into any data transfer operation be a pointer to struct fi_context. That context is for use by the provider, and allows the provider to avoid allocating a separate per operation context.

Is the app specifying support for FI_CONTEXT (i.e. setting fi_info->mode |= FI_CONTEXT) and using the API correctly when set?

sithhell commented 7 years ago

The psm2 provider uses FI_CONTEXT, which is set as an fi_info mode bit. That requires that the user context passed into any data transfer operation be a pointer to struct fi_context. That context is for use by the provider, and allows the provider to avoid allocating a separate per operation context.

Ok, I was not aware of that. This also means that my application needs to perform a lookup to associate the passed fi_context to my application provided context, correct?

Is the app specifying support for FI_CONTEXT (i.e. setting fi_info->mode |= FI_CONTEXT) and using the API correctly when set?

No it was not. The number one reason for that is the use case I initially described. IIUC, this would render this usecase unusable and I have to maintain an additional application logic to achieve my initial goal. Is this also compatible with the GNI provider?

sithhell commented 7 years ago

Would you accept a patch that extends fi_context to have an additional void *context (or whichever name you prefer) pointer that would allow to carry additional application specific information?

sithhell commented 7 years ago

The patch is fairly simple:

diff --git a/include/rdma/fabric.h b/include/rdma/fabric.h
index 956cd4e..206baf1 100644
--- a/include/rdma/fabric.h
+++ b/include/rdma/fabric.h
@@ -575,10 +575,12 @@ void fi_freeparams(struct fi_param *params);
 #ifndef FABRIC_DIRECT_
 struct fi_context {
        void                    *internal[4];
+    void            *context;
 };

 struct fi_context2 {
        void                    *internal[8];
+    void            *context;
 };
 #endif
shefty commented 7 years ago

Would you accept a patch that extends fi_context to have an additional void *context (or whichever name you prefer) pointer that would allow to carry additional application specific information?

The way apps typically handle fi_context is to embed it into their own context structures. Then use container_of() to cast back. E.g.

sruct my_context { struct fi_context ofi_context; void my_user_context; void user_buffer; size_t user_transfer_size; ... };

Since many libfabric users require their own context structure, it's usually trivial to include the fi_context structure as well. Doing this eliminates the provider to allocating/returning the data structure from an internal pool.

sithhell commented 7 years ago

Am 24.07.2017 6:34 nachm. schrieb "Sean Hefty" notifications@github.com:

Would you accept a patch that extends fi_context to have an additional void *context (or whichever name you prefer) pointer that would allow to carry additional application specific information?

The way apps typically handle fi_context is to embed it into their own context structures. Then use container_of() to cast back. E.g.

I have a problem with container_of not working with pure virtual base classes (c++, due to constraints on offsetof). I use a cast to a base pointer, then call it's virtual function to dispatch to the correct handler. Using container_of in that scenario is undefined. I'll to have to ponder a bit on how I'll make this work if you don't want to include the explicit back pointer.

sruct my_context { struct fi_context ofi_context; void my_user_context; void user_buffer; size_t user_transfer_size; ... };

Since many libfabric users require their own context structure, it's usually trivial to include the fi_context structure as well. Doing this eliminates the provider to allocating/returning the data structure from an internal pool.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ofiwg/libfabric/issues/3150#issuecomment-317479437, or mute the thread https://github.com/notifications/unsubscribe-auth/AADRlwB0V0IbTjDOtCprkVSAKKGR_P7_ks5sRMeggaJpZM4OfW5_ .

shefty commented 7 years ago

The problem with trying to modify struct fi_context is that it breaks applications. We can't make that change.

epaulson10 commented 7 years ago

You don't necessarily need to use container_of(). I just put struct fi_context at the top of my data structures and then ignore it forever.

sithhell commented 7 years ago

You don't necessarily need to use container_of(). I just put struct fi_context at the top of my data structures and then ignore it forever.

This has the same problems (in my case) than offsetof.

What I'll do is to use your suggestion though and just create another layer of indirection. Just for reference, the application context will look like this:

struct receiver : struct some_base
{
    receiver() { context_.this_ = this; };
    void post_receive()
    {
        fi_recv(endpoint, buf, len, desc, FI_ADDR_UNSPEC, &context_);
    };

    void handle_recv();
    // members and stuff ...
    struct
    {
        fi_context dummy_;
        some_base *this_;

        void handle_recv() { this_->handle_recv(); }
    } context_;
};

This covers my use case nicely, and hopefully fixes the issue. Thanks for the help!

sithhell commented 7 years ago

Just one more note: This is technically still UB in C++ country (due to lifetime issues, if you are interested I can explain further), but since the pointer is passed to a C library, it's just fine.