Closed cgehler closed 6 years ago
Please do. I'll try to reproduce internally.
Additionally, please attach the following information:
Am working on getting the reproducer for you.
In the mean time: I was running on Crystal, CLE-6.0UP04, using the CSS CLE-6.x build, most likely 2da1515. (Did not build the library myself.)
I believe we need to
ret = -FI_EINVAL
before the
goto err_mr_reg
in _gnix_sendv.
I was looking at that, but the goto happens later unless I'm reading it incorrectly. https://github.com/ofi-cray/libfabric-cray/blob/master/prov/gni/src/gnix_msg.c#L3348
Also the error is coming from recvv
Did you have the correct number of memory descriptors for the iov?
No, when I stumbled upon this, I did not. I provided only one memory descriptor (my error), but there were >1 entries in the iov.
I am pretty sure the error is coming from sendv: https://github.com/ofi-cray/libfabric-cray/blob/master/prov/gni/src/gnix_msg.c#L3583
I am working on pairing down the program to make it simple for you to build. I have a test program that reproduces it on Crystal, located in css, but it pulls in modules from our test library for helper functions. /home/users/cgehler/css/libfabric/libfabric-ostest/RUN/bin/issue1390 (binary) /home/users/cgehler/css/libfabric/libfabric-ostest/datatransfer/msg/issue1390.c (To reproduce, run: issue1390 -n 4 -s 1024 )
Basically, to reproduce it, provide a valid memory descriptor for the first entry, and NULL for the rest of the mr_desc entries.
That's the same place I identified. It's definitely wrong to not set the error code before the goto. I think it may be more expedient to make a test build for @cgehler than for her to trim down her test case.
This caused the gni code to hit the line outputting the debug message: "invalid memory registration" (~line 3345 of gnix_msg.c). T
Are we sure its the same error? The line you provided was 3345, not 3583.
Regardless, the line at 3583 is incorrect and should be fixed.
Test with the developer-provided fix was successful. fi_sendv now returns FI_EINVAL, as expected, in this situation.
Thanks!
The GNI provider returns early from fi_sendv due to an error situation without first setting the return code to something other than FI_SUCCESS. Thinking the fi_sendv was successful, then, the application hangs waiting for a send completion.
In my program, the cumulative size of the iovec buffer lengths >= GNI_MSG_RENDEZVOUS_THRESHOLD. In error, I provided a single mr_desc, rather than an array of them. This caused the gni code to hit the line outputting the debug message: "invalid memory registration" (~line 3345 of gnix_msg.c). The gni code then goes to an error exit (err_mr_reg), where it does a _gnix_fr_free and then returns. The return code is never set to a value other than its initialized value, FI_SUCCESS.
I can provide a reproducer, if needed.