ofi-cray / libfabric-cray

Open Fabric Interfaces
http://ofiwg.github.io/libfabric/
Other
16 stars 9 forks source link

GNI provider returns FI_SUCCESS from fi_sendv in error situation #1390

Closed cgehler closed 6 years ago

cgehler commented 6 years ago

The GNI provider returns early from fi_sendv due to an error situation without first setting the return code to something other than FI_SUCCESS. Thinking the fi_sendv was successful, then, the application hangs waiting for a send completion.

In my program, the cumulative size of the iovec buffer lengths >= GNI_MSG_RENDEZVOUS_THRESHOLD. In error, I provided a single mr_desc, rather than an array of them. This caused the gni code to hit the line outputting the debug message: "invalid memory registration" (~line 3345 of gnix_msg.c). The gni code then goes to an error exit (err_mr_reg), where it does a _gnix_fr_free and then returns. The return code is never set to a value other than its initialized value, FI_SUCCESS.

I can provide a reproducer, if needed.

jswaro commented 6 years ago

Please do. I'll try to reproduce internally.

jswaro commented 6 years ago

Additionally, please attach the following information:

cgehler commented 6 years ago

Am working on getting the reproducer for you.

In the mean time: I was running on Crystal, CLE-6.0UP04, using the CSS CLE-6.x build, most likely 2da1515. (Did not build the library myself.)

chuckfossen commented 6 years ago

I believe we need to ret = -FI_EINVAL before the goto err_mr_reg in _gnix_sendv.

jswaro commented 6 years ago

I was looking at that, but the goto happens later unless I'm reading it incorrectly. https://github.com/ofi-cray/libfabric-cray/blob/master/prov/gni/src/gnix_msg.c#L3348

Also the error is coming from recvv

jswaro commented 6 years ago

Did you have the correct number of memory descriptors for the iov?

cgehler commented 6 years ago

No, when I stumbled upon this, I did not. I provided only one memory descriptor (my error), but there were >1 entries in the iov.

I am pretty sure the error is coming from sendv: https://github.com/ofi-cray/libfabric-cray/blob/master/prov/gni/src/gnix_msg.c#L3583

I am working on pairing down the program to make it simple for you to build. I have a test program that reproduces it on Crystal, located in css, but it pulls in modules from our test library for helper functions. /home/users/cgehler/css/libfabric/libfabric-ostest/RUN/bin/issue1390 (binary) /home/users/cgehler/css/libfabric/libfabric-ostest/datatransfer/msg/issue1390.c (To reproduce, run: issue1390 -n 4 -s 1024 )

Basically, to reproduce it, provide a valid memory descriptor for the first entry, and NULL for the rest of the mr_desc entries.

chuckfossen commented 6 years ago

That's the same place I identified. It's definitely wrong to not set the error code before the goto. I think it may be more expedient to make a test build for @cgehler than for her to trim down her test case.

jswaro commented 6 years ago

This caused the gni code to hit the line outputting the debug message: "invalid memory registration" (~line 3345 of gnix_msg.c). T

Are we sure its the same error? The line you provided was 3345, not 3583.

Regardless, the line at 3583 is incorrect and should be fixed.

cgehler commented 6 years ago

Test with the developer-provided fix was successful. fi_sendv now returns FI_EINVAL, as expected, in this situation.

jswaro commented 6 years ago

Thanks!