[usnic] Sporadically invalid Rx message size (as reported by the CQ event)

wavesoft commented 8 years ago

Hi all, I am using the master (dfd9f4a) version of the library, testing the usnic provider through a benchmark utility as I explain in #1826 (including guidelines on how to reproduce). I am using EP_MSG endpoints and I am sending messages with different sizes in one direction.

I was expecting that MSG endpoints ensure reliable transmission, therefore I either expect to see a complete message arrived or a reception error. However, what I see is that I receive messages with a size different than the size sent. In addition, the bigger the message size, the more reception errors I see.

The curious part is that the message lengths seem to be some particular multiplicand and not just random values. For example, I am sending 100,000 messages, and I see the following:

No errors on messages < MTU (=9000)
Sending message length = 9000 (+header = just above MTU)
- Received 1 faulty message
- Fauly lengths : 62
Sending message length = 18000
- Received 2154 faulty messages
- Fauly lengths : 9062, 124
Sending message length = 102400
- Received 41960 faulty messages
- Fauly lengths : 13020, 21958, 30896, 4082, 48772, 57710, 66648, 75586, 84524, 93462

Also, note that when a packet of invalid size arrives, I see a spike on network activity on the opposite direction (assuming this is some kind of recovery procedure?)

I am confident it's not the benchmark utility because I don't see the same behavior with the sockets provider.

For convenience I am repeating the process to build the sample and re-produce the error:

# Work in an isolated directory
mkdir bug; cd bug
BASEDIR=$(pwd)
# Prepare nanomsg
git clone https://github.com/wavesoft/nanomsg.git
cd nanomsg; ./autogen.sh; cd ..
# Apply OFI patch
git clone https://github.com/wavesoft/nanomsg-transport-ofi.git
cd nanomsg-transport-ofi; git checkout devel
./patch-nanomsg.sh $BASEDIR/nanomsg
# Comment-out line 32 from src/transports/ofi/ofi.h
# in order to disable debug logging (optional)
cd ../nanomsg
vi src/transports/ofi/ofi.h
# Build and install nanomsg in the local folder
./configure --prefix=$BASEDIR/local
make -j8 && make install
# Build the nanomsg tests
cd ../nanomsg-transport-ofi/test
make nanomsg NANOMSG_DIR=$BASEDIR/local

Then, start one server and one client:

# Start a server (the IP address should be a local usnic address)
# The third argument is the size of the message (defaults to 10240)
./test_nanomsg_timing node0 ofi://192.168.0.1:5050 9000
# Start a client (in another machine)
# The message size must be the same as node0
./test_nanomsg_timing node1 ofi://192.168.0.1:5050 9000

When an error occurs, you will see messages like this on the node0:

!! Received 93462 instead of 102400

goodell commented 8 years ago

@bturrubiates can you take a look here?

bturrubiates commented 8 years ago

@goodell Sure I'll take a look.

However, what I see is that I receive messages with a size different than the size sent. In addition, the bigger the message size, the more reception errors I see.

When you say reception errors, do you mean that the data in the message is invalid?

No errors on messages < MTU (=9000) Sending message length = 9000 (+header = just above MTU) Received 1 faulty message Fauly lengths : 62 Sending message length = 18000 Received 2154 faulty messages Fauly lengths : 9062, 124 Sending message length = 102400 Received 41960 faulty messages Fauly lengths : 13020, 21958, 30896, 4082, 48772, 57710, 66648, 75586, 84524, 93462

So you were expecting a size of 9000 or greater and got a length of 62 in one instance?

@wavesoft I'll try this out soon, thanks for providing something I can reproduce with.

wavesoft commented 8 years ago

Hello @bturrubiates

When you say reception errors, do you mean that the data in the message is invalid?

Right, I didn't clarify this... I am not currently checking the integrity of the message, I am only checking it's size.

So you were expecting a size of 9000 or greater and got a length of 62 in one instance?

I am expecting a message of size 9000 bytes exactly and I get a length of 62 -- as reported by the CQ event always.

bturrubiates commented 8 years ago

I am expecting a message of size 9000 bytes exactly and I get a length of 62 -- as reported by the CQ event always.

Oh that's interesting. So it's consistently wrong in the same way? That should make this easier to debug.

wavesoft commented 8 years ago

Oh that's interesting. So it's consistently wrong in the same way? That should make this easier to debug.

Yes. Consistently wrong in the same way, for the same message size. The message lengths that I mention are the different lengths that I encounter. The probabilities of most of them are (more or less) evenly distributed. If you want more details about their distribution I could collect more detailed information...

Ps. I refer to the consistency of the error. I don't always get the size wrong, but when I do get it the error is consistent.

bturrubiates commented 8 years ago

Yes. Consistently wrong in the same way, for the same message size. The message lengths that I mention are the different lengths that I encounter. The probabilities of most of them are (more or less) evenly distributed. If you want more details about their distribution I could collect more detailed information...

Thanks, I'll see if I can spot what's going on. No need to collect distribution information.

bturrubiates commented 8 years ago

@wavesoft

Having trouble getting this to run. The client side always terminates after receiving an EAGAIN on something. I'm at the same libfabric commit.

[-01426994272] OFI[C]: Creating connected OFI socket
[-01426994272] OFI[C]: Createing socket for (domain=10.5.20.195, service=5050)
OFI[H]: Using fabric=usnic_0, provider=usnic
[-01426994272] OFI[C]: nn_cofi_handler state=1, src=-2, type=-2
[-01426994272] OFI[C]: Creating new SOFI
[-01426994272] OFI[i]: Initializing Input FSM
[-01426994272] OFI[i]: Allocating buffers len=2, size=131072
[-01426994272] OFI[i]: Allocated RxMR chunk=0xb83690, key=61953
[-01426994272] OFI[i]: Allocated RxMR chunk=0xb836d0, key=61954
[-01426994272] OFI[o]: Initializing Output FSM
[-01426994272] OFI[-]: Initializing MRM with len=2, key=62209
[-01426994272] OFI[-]: Allocated MRM chunk=0xba3db0
[-01426994272] OFI[-]: Allocated MRM chunk=0xba3e98
[-01426994272] OFI[S]: Starting FSM
[-01426994272] OFI[S]: nn_sofi_handler state=1001, src=-2, type=-2
[-01426994272] OFI[S]: Performing handshake
[-01426994272] OFI[S]: Posting buffers for receiving handshake
[-01426994272] OFI[i]: Blocking RX Post max_sz=4
[-01426994272] OFI[i] ### POSTING RECEIVE BUFFER len=1024, ctx=0x6431a0
[-01426994272] OFI[S]: Sending handshake
[-01426994272] OFI[o]: Blocking TX max_sz=4, timeout=1000
[-01426994272] OFI[o] ### POSTING SEND BUFFER len=4
[-01426994272] OFI[o]: Blocking TX completed (ctx=0x643450)
[-01426994272] OFI[S]: Receiving handshake
[-01426994272] OFI[i]: Blocking RX Recv max_sz=4, timeout=1000
[-01426994272] OFI[i]: Blocking RX Completed, len=4, ctx=0x6431a0
[-01426994272] OFI[i]: Starting Input FSM
[-01426994272] OFI[i]: nn_sofi_in_handler state=2001, src=-2, type=-2
[-01426994272] OFI[i]: chunk=0xb83690, flags=0
[-01426994272] OFI[i]: chunk=0xb836d0, flags=0
[-01426994272] OFI[i]: >>>>>> Grab MR 0xb83690
[-01426994272] OFI[i]: Posting buffers from RxMR chunk=0xb83690 (ctx=0xb836a8, buf=0xb83748)
[-01426994272] OFI[i] ### POSTING RECEIVE BUFFER len=131072, ctx=0xb836a8
[-01426994272] OFI[i]: Input buffers posted
[-01426994272] OFI[i]: Switching to NN_SOFI_IN_STATE_READY
[-01426994272] OFI[S]: nn_sofi_handler state=1003, src=1101, type=2201
[-01426994272] OFI[S]: Initializing OFI-Output
[-01426994272] OFI[o]: Starting Output FSM
[-01426994272] OFI[o]: nn_sofi_out_handler state=3001, src=-2, type=-2
[-01426994272] OFI[S]: nn_sofi_handler state=1004, src=1102, type=3201
[-01426994272] OFI[S]: Initialized, switching to RUNNING
TIM: I will be sending
-- Sending 0
[-01426994272] OFI[S]: NanoMsg SEND event
[-01426994272] OFI[-]: Picking MRM chunk for ptr=0xba40e8
[-01426994272] OFI[-]: Found MRM free chunk=0xba3db0
[-01426994272] OFI[-]: Managing MRM base=0xba3db0 chunk=0xba3db0 index=1
[-01426994272] OFI[-]: Registering ptr=0xba40e8, len=9000, key=0x0000f302
[-01373559040] OFI[p]: Starting poller thread
[-01426994272] --LOCKING 0xba3db0--
Resource temporarily unavailable [11] (src/utils/efd_eventfd.inc:91)
[1]    11553 abort      ./test_nanomsg_timing node1 ofi://10.5.20.195:5050 9000

wavesoft commented 8 years ago

Ah, there might be a commit missing from my side. Can you try commenting-out this line?: https://github.com/wavesoft/nanomsg-transport-ofi/blob/devel/src/transports/ofi/utils/mrm.c#L188

bturrubiates commented 8 years ago

Ah, there might be a commit missing from my side. Can you try commenting-out this line?: https://github.com/wavesoft/nanomsg-transport-ofi/blob/devel/src/transports/ofi/utils/mrm.c#L188

Thanks, that seems to have worked.

bturrubiates commented 8 years ago

@wavesoft Are both ends of the program supposed to be ending? I shortened the iteration count to 10 to make it a bit more manageable and I noticed that node0 doesn't always complete all of the iterations. Is that also a related problem?

edit: There's definitely something wrong with the reliability management. I haven't quite tracked it down yet.

wavesoft commented 8 years ago

@wavesoft Are both ends of the program supposed to be ending? I shortened the iteration count to 10 to make it a bit more manageable and I noticed that node0 doesn't always complete all of the iterations. Is that also a related problem?

Ideally yes, they should both finish after sending/receiving the same number of messages. I noticed that if the sending end (node1) always finishes, but the receiving end (node0) halts at the moment the node1 is terminated. That's because node0 will infinitely wait for the correct number of messages to arrive. (I have a keepalive timeout mechanism planned but currently disabled -- since I don't receive any FI_SHUTDOWN event on the endpoint).

This raises a question which I don't know if it's a bug: node1 managed to send all of it's messages, but the moment the link was interrupted, node0 stopped receiving messages. Shouldn't the queued messages arrive to the destination, or it makes more sense to discard the queue the moment the connection is interrupted?

For the latter I guess a simple solution would be to wait for all messages to be sent before closing the socket. But how can I implement this? I noticed FI_DELIVERY_COMPLETE on fi_send, but I also need a CQ event the moment I can re-use the buffer. Can I listen for both events in different CQs somehow?

bturrubiates commented 8 years ago

I noticed that if the sending end (node1) always finishes, but the receiving end (node0) halts at the moment the node1 is terminated. That's because node0 will infinitely wait for the correct number of messages to arrive. (I have a keepalive timeout mechanism planned but currently disabled -- since I don't receive any FI_SHUTDOWN event on the endpoint).

This raises a question which I don't know if it's a bug: node1 managed to send all of it's messages, but the moment the link was interrupted, node0 stopped receiving messages. Shouldn't the queued messages arrive to the destination, or it makes more sense to discard the queue the moment the connection is interrupted?

Are you checking send completions? If so, then this is a bug. A send completion for a message shouldn't arrive until it has been acked by the other side.

wavesoft commented 8 years ago

I am binding my Tx CQ with the FI_TRANSMIT flag, because this is the most common set-up that all providers should have implemented. But you might be right, I think I am shutting down the socket as fast as possible, without waiting for all the pending tx CQ events.

However, in order to increase throughput I didn't want to wait for an acknowledgement from the other side before sending the next packet, so my approach was to call fi_send as frequently as possible, and wait for a CQ event to release the associated buffers for re-use later. Now, optimizing this even further, I thought of using the FI_SELECTIVE_COMPLETION flag with the CQ and use the FI_INJECT_COMPLETE on fi_sendmsg in order to re-use the buffers even quicker.

bturrubiates commented 8 years ago

I am binding my Tx CQ with the FI_TRANSMIT flag, because this is the most common set-up that all providers should have implemented. But you might be right, I think I am shutting down the socket as fast as possible, without waiting for all the pending tx CQ events.

Oh, it might be worth seeing if the same behavior (ending early) is experienced if you reap the TX cq events on the transmit side.

However, in order to increase throughput I didn't want to wait for an acknowledgement from the other side before sending the next packet, so my approach was to call fi_send as frequently as possible, and wait for a CQ event to release the associated buffers for re-use later. Now, optimizing this even further, I thought of using the FI_SELECTIVE_COMPLETION flag with the CQ and use the FI_INJECT_COMPLETE on fi_sendmsg in order to re-use the buffers even quicker.

The usNIC provider currently provides support for FI_INJECT_COMPLETE, FI_TRANSMIT_COMPLETE, and FI_DELIVERY_COMPLETE. The behavior of the completion flags should all be identical and should behave according to the description for FI_DELIVERY_COMPLETE. For reliable endpoints, I'm not sure it makes sense for a completion to be generated until it is safely on the other side.

The only one that provides immediate buffer access is using the FI_INJECT flag. Using FI_INJECT places a size restriction per message that is quite small. I believe this is exposed as inject_size through the TX attributes.

wavesoft commented 8 years ago

Oh, it might be worth seeing if the same behavior (ending early) is experienced if you reap the TX cq events on the transmit side.

I am always waiting for the TX cq before freeing a particular buffer, but I also do push more than one fi_send request at the same time. In detail I have the following set-up:

I have a finite number of memory registration "banks". Each bank contains the MR description and some meta-data (namely base address, length and ID).
When a user tries to send a message, I check if the pointer is already registered in one of these banks
- If yes, I use the MR description from that bank
- If no, I pick the oldest, free one, I de-register the previous region and I register the new one
- If there are no free banks, the operation blocks until one becomes available
I then mark the particular bank as 'in-use' and I call fi_send (without waiting for any previous fi_send to complete!)
When the fi_send eventually completes (asynchronously, waiting for a FI_TRANSMIT CQ event), I mark the MR bank as 'free' allowing it to be re-used.

I do so, aiming to saturate the link as much a possible, since waiting for an acknowledgement after every message slows the communication down.

Is such use expected? Did you manage to find anything wrong on the provider or is there something wrong with my approach?

The usNIC provider currently provides support for FI_INJECT_COMPLETE, FI_TRANSMIT_COMPLETE, and FI_DELIVERY_COMPLETE. The behavior of the completion flags should all be identical and should behave according to the description for FI_DELIVERY_COMPLETE. For reliable endpoints, I'm not sure it makes sense for a completion to be generated until it is safely on the other side.

Fair point. Indeed it only makes sense to generate a CQ event when the message is delivered. I was just thinking if it's possible to increase the throughput by stacking more fi_send requests. For example, waiting for a FI_INJECT_COMPLETE event will allow the MR bank to be released quicker, allowing more fi_send requests to be placed.

bturrubiates commented 8 years ago

Is such use expected? Did you manage to find anything wrong on the provider or is there something wrong with my approach?

I can't say for sure right now, but it definitely seems there are a couple of things wrong with the provider code. From what I read, the approach sounds fine. I haven't had much time to look at it. Should have more time next week, sorry for the delay.

Fair point. Indeed it only makes sense to generate a CQ event when the message is delivered. I was just thinking if it's possible to increase the throughput by stacking more fi_send requests. For example, waiting for a FI_INJECT_COMPLETE event will allow the MR bank to be released quicker, allowing more fi_send requests to be placed.

I think in order to support that it would need to introduce an extra alloc or memcpy into the send path which would probably also slow things down significantly. Let me think about this more.

goodell commented 8 years ago

Fair point. Indeed it only makes sense to generate a CQ event when the message is delivered. I was just thinking if it's possible to increase the throughput by stacking more fi_send requests. For example, waiting for a FI_INJECT_COMPLETE event will allow the MR bank to be released quicker, allowing more fi_send requests to be placed.

The message will have to be buffered somewhere though until fully received and acknowledged by the other side, and at some point there would need to be back pressure if you can feed things into the queue faster than it can drain to the network. Unless you're sending only very small messages or using an undersized queue, it makes the most sense to me to use the completions as the back pressure mechanism and keep the queueing fairly explicit.

wavesoft commented 8 years ago

Hello @bturrubiates , @goodell and thanks for your feedback. The optimised version is not urgent, so take your time. It just would be good to know what's wrong with the message sizes...

The message will have to be buffered somewhere though until fully received and acknowledged by the other side, and at some point there would need to be back pressure if you can feed things into the queue faster than it can drain to the network. Unless you're sending only very small messages or using an undersized queue, it makes the most sense to me to use the completions as the back pressure mechanism and keep the queueing fairly explicit.

I already have such mechanism in place. If I don't get Tx CQs on time, the available MR banks will be exhausted and the next Tx operation will block, effectively throttling the speed. However I am still far from saturating the link and I don't think I ever encounter back-pressure from the nic (so far I managed to get 25Gbit on 40Gbit link, with a single-process, single-connection set-up - though I don't trust my numbers until I have fixed all the little issues in my transport)

Now that I think of it, for optimal use of resources I would set the number of my MR banks to be equal with the size of the provider queue. Is there any way to get this number through the libfabric API?

goodell commented 8 years ago

I already have such mechanism in place. If I don't get Tx CQs on time, the available MR banks will be exhausted and the next Tx operation will block, effectively throttling the speed. However I am still far from saturating the link and I don't think I ever encounter back-pressure from the nic (so far I managed to get 25Gbit on 40Gbit link, with a single-process, single-connection set-up - though I don't trust my numbers until I have fixed all the little issues in my transport)

It's possible the reliability protocol has some inefficiency, though I thought we could get full link bandwidth using EP_MSG last time I checked (quite a while ago at this point). If you can't saturate the link after fixing your number of buffers, let us know and we can help you look into it. Feel free to contact Ben, Jeff, and me via email for that sort of thing.

Now that I think of it, for optimal use of resources I would set the number of my MR banks to be equal with the size of the provider queue. Is there any way to get this number through the libfabric API?

Check out info->tx_attr->size (i.e., fi_tx_attr::size). There's a comparable RX attribute as well.

wavesoft commented 8 years ago

Hello @goodell , apparently there must be something wrong with our set-up because we can't seem to get a throughput bigger than 25Gbps, even with TCP+nload+multiple streams. We tried different cables (copper and optical) with the same result. The usd_devinfo utility always reports a bandwidth of 40Gbits... So I am not sure if this is actually the source of the bug I reported.

I also tried what @bturrubiates suggested, and I am now waiting for all the CQs before stopping the socket, but that also doesn't help too.

With the new version of my transport (devel-ofiw branch), the error is more consistent:

On small message sizes (<MTU), the CQ size reported is always the double than the length of the data sent
On larger message sizes I see a similar behavior, like the one I mentioned in the first post, but the reported size is at least the double.

Among with other changes, in the new version I am saturating the provider's tx queue (posting as quickly as possible till I reach fi_tx_attr::size posts, and then back-pressure is applied through the CQ events), which also means that a big number of messages is always in transit.

Likewise, on the receiving end, I am posting the receive buffers as fast as I can (ex. at the time I receive the Rx CQ event), and back-pressure is applied again through the CQ events. In my transport I don't use more than 2 Rx buffers.

I also tried to create the Rx CQ with .size = 1 attribute, but this doesn't seem to affect the outcome. I still see message size reported by CQ to be double.

Finally, you might be able to reproduce the problem consistently. If not, then we can assume that there is something wrong with my hardware set-up, and I am open to suggestions (perhaps over e-mail).

Like before, here is the copy-paste-able version of the steps you need to reproduce the error:

# Work in an isolated directory
mkdir bug; cd bug
BASEDIR=$(pwd)
# Prepare nanomsg
git clone https://github.com/wavesoft/nanomsg.git
cd nanomsg; git checkout pull-nn_allocmsg_ptr; ./autogen.sh; cd ..
# Apply OFI patch
git clone https://github.com/wavesoft/nanomsg-transport-ofi.git
cd nanomsg-transport-ofi; git checkout devel-ofiw
./patch-nanomsg.sh $BASEDIR/nanomsg
# Prepare nanomsg
cd ../nanomsg
# ---
# Comment-out line 32 from src/transports/ofi/ofi.h
# in order to disable debug logging (optional)
#vi src/transports/ofi/ofi.h
# ---
# Build and install nanomsg in the local folder
./configure --prefix=$BASEDIR/local
make -j8 && make install
# Build the nanomsg tests
cd ../nanomsg-transport-ofi/test
make nanomsg NANOMSG_DIR=$BASEDIR/local

# Then start one binding endpoint
./test_nanomsg_timing node0 ofi://<local_ip>:<port>
# And one connecting endpoint
./test_nanomsg_timing node1 ofi://<remote_ip>:<port>

goodell commented 8 years ago

Let's take this discussion to email for now and we'll bring the results of the investigation back to the GH issue after we know more. I'll send you an email soon to get the ball rolling.

goodell commented 8 years ago

To tie this up, there were two issues impacting performance:

The NIC firmware version was several years old. The old firmware incorrectly allocated some internal NIC buffers on 40G cards, making it impossible to hit line rate at 40G. Updating to the latest released firmware fixes this issue.
The cards were installed in the wrong PCI slots (in this case slot 1 in a C240M4 instead of slot 2). This clipped this particular card to PCIe gen2 x8 (32 Gb/s bus BW) instead of the expected gen2 x16 (64 Gb/s), limiting the network bandwidth to ~26 Gb/s.

These have both been resolved, so performance is back into an acceptable range.

I think @wavesoft is still experiencing this issue:

With the new version of my transport (devel-ofiw branch), the error is more consistent:

On small message sizes (<MTU), the CQ size reported is always the double than the length of the data sent

On larger message sizes I see a similar behavior, like the one I mentioned in the first post, but the reported size is at least the double.

@bturrubiates will look into that issue as time allows.

wavesoft commented 8 years ago

Hello guys. As @goodell mentioned indeed the performance penalties are now fixed, but the instability issues are still there, however @bturrubiates I have some feedback for you:

I was baffled why my code isn't working, while fabtests are unaffected by this problem*, so I traced the two projects side-by-side in order to find any difference:

	fabtests	nanomsg-ofi
1	Creates CQs with FI_CQ_FORMAT_CONTEXT	Creates CQs with FI_CQ_FORMAT_MSG
2	Posts only 1 Rx buffer	Can post up to maximum queue size buffers (16 on tests)
3	Blocking-wait for CQ completion after I/O operation (using `fi_cq_sread`)	Asynchronous poll of CQ status in a different thread (using `fi_cq_read`, in spin-loop)

The difference 3 seems to be the most important: What I figured out is that if I exclude the CQ polling from my polling thread (the thread still polls for EQ events) and I try to do fi_cq_sread when possible, the code works perfectly! Effectively for me it looks like there is something wrong when doing I/O operations while polling the bound CQ. Perhaps thread-safety issues?

For the difference 2 I tried various failed attempts:

I tried to post only 1 buffer but this didn't change anything.
I also tried to increase the number of entries the fi_cq_read function returns (just in case my code gets slow) : I actually saw a few calls returning many CQ events, but I still faced the same problem.

For the difference 1 I tried to add an additional custom header in order to know the actual message size and I switched to using context-only CQ events. I saw an interesting behavior:

The first X messages are always correct and then they get corrupted at random interval (wrong header and wrong data checksum), where X is the number of different Rx buffers that I posted at once, when the connection was established.

More or less I pinpointed the problem in threading, but I would definitely like to have a second pair of eyes looking at my code since I am running out of ideas... If you are interesting and have some free time, would you like to have a webex code-review/brainstorming with me?

The problem I refer to is the following : I first start the server (binding-end) then the client (connecting-end) and the client starts sending one-direction messages. After some time I stop receiving Tx CQ operations on the client side, while on the server-side I sporadically see corrupted data (in addition to the invalid size that is now constant no matter what's the message size).

goodell commented 8 years ago

Asynchronous poll of CQ status in a different thread (using fi_cq_read, in spin-loop)

(emphasis mine)

Yes, this will almost certainly lead to corruption in the current implementation because the usnic provider isn't generally thread safe. Right now it looks like we're reporting FI_THREAD_ENDPOINT in the domain attributes for FI_EP_RDM and FI_THREAD_UNSPEC for FI_EP_MSG. We should probably be reporting FI_THREAD_COMPLETION (or possibly even FI_THREAD_DOMAIN if we're mishandling the internal peer table) instead, which would prohibit the sort of multi-threaded accesses you are currently making. Such a change would probably require you to add a lock in your code around all libfabric accesses or otherwise rearrange to not make multithreaded accesses.

@bturrubiates seems like there are a few takeaways from this:

We should update our default fi_domain_attr::threading values to be more accurate when the user specifies FI_THREAD_UNSPEC. To a first order, we should probably be reporting FI_THREAD_DOMAIN for MSG/RDM, and FI_THREAD_COMPLETION for DGRAM (_ENDPOINT might be possible there, but would require an audit).
We should add support for FI_THREAD_SAFE, since we're currently violating this bit of the man pages:

  FI_THREAD_SAFE  :  A  thread  safe   serialization   model   allows   a
  multi-threaded  application  to  access any allocated resources through
  any interface without restriction.  All providers are required to  sup-
  port FI_THREAD_SAFE.

bturrubiates commented 8 years ago

What I figured out is that if I exclude the CQ polling from my polling thread (the thread still polls for EQ events) and I try to do fi_cq_sread when possible, the code works perfectly! Effectively for me it looks like there is something wrong when doing I/O operations while polling the bound CQ. Perhaps thread-safety issues?

There are probably some thread-safety issues lurking about in the provider. Does this fix the completion length problem?

We should update our default fi_domain_attr::threading values to be more accurate when the user specifies FI_THREAD_UNSPEC. To a first order, we should probably be reporting FI_THREAD_DOMAIN for MSG/RDM, and FI_THREAD_COMPLETION for DGRAM (_ENDPOINT might be possible there, but would require an audit). We should add support for FI_THREAD_SAFE, since we're currently violating this bit of the man pages:

Yeah, we should add this for the next release.

The problem I refer to is the following : I first start the server (binding-end) then the client (connecting-end) and the client starts sending one-direction messages. After some time I stop receiving Tx CQ operations on the client side, while on the server-side I sporadically see corrupted data (in addition to the invalid size that is now constant no matter what's the message size).

So the client side is hanging and/or not reporting completions, but the server side is receiving corrupted data?

wavesoft commented 8 years ago

Yes, this will almost certainly lead to corruption in the current implementation because the usnic provider isn't generally thread safe.

Gotcha. I will try locking on endpoint-level operations first and if this doesn't work, on domain-level afterwards.

There are probably some thread-safety issues lurking about in the provider. Does this fix the completion length problem?

Unfortunately not. I still see the invalid message sizes. However the reported size is now constant (using 1.3.0 release). Trying various message sizes I see the following (MTU=9000) numbers (let me know if you want me to test some specific value):

1 b (Sent) --> 4196 b (Reported by the Rx CQ, always the same now)
10 --> 4106
16 --> 4112
20 --> 4116
24 --> 4120
30 --> 4126
32 --> 4128
100 --> 4196
1000 --> 5096
1024 --> 5120
2000 --> 6096
2048 --> 6144
100000 --> 202400
102400 --> 204800
1048576 --> 2097152

So the client side is hanging and/or not reporting completions, but the server side is receiving corrupted data?

Yes, The client (sending) side is hanging at some point (it blocks due to back-pressure since completions are not reported). The server (receiving) also stops receiving data at the same time. However till that point the data arrived are rarely correct. In detail, size reported wrong from CQ and most of the the checksum fails (trimming the data to the expected size). My guess is that due to data corruption the server drops some packets or sends wrong acknowledgement messages? Again this happens only on the threaded version.

wavesoft commented 8 years ago

Hello guys, like @goodell mentioned, I tried locking around all relevant libfabric calls, but I still see a weird behaviour.

Such a change would probably require you to add a lock in your code around all libfabric accesses or otherwise rearrange to not make multithreaded accesses.

I was wondering. If you are available, do you have some free time today or later this week to sit together and have a look on the code, just in case a second pair of eyes can spot some problems I am not aware of?

ofiwg / libfabric

[usnic] Sporadically invalid Rx message size (as reported by the CQ event) #1832