nutanix / libvfio-user

framework for emulating devices in userspace
BSD 3-Clause "New" or "Revised" License
166 stars 51 forks source link

partial socket IO isn't handled #306

Open jlevon opened 3 years ago

jlevon commented 3 years ago

Blocking I/O in API

Note that neither the listen fd nor the conn fd are non-blocking. Instead, the get_request_sock() callback sets non-blocking recvmsg() flags if the LIBVFIO_USER_FLAG_ATTACH_NB flag is set.

If we're supposed to be non-blocking, there are multiple paths whereby we might block still:

vfu_dma_read/write(),vfu_irq_message()
  ->vfu_msg()->vfu_msg_fds()->vfu_msg_iovec()->vfu_send_iovec()->sendmsg()
                                             ->vfu_recv_fds()->getmsg()->recvmsg()
                                                             ->recv_blocking()->recvmsg()
process_request()->vfu_send_iovec()->sendmsg() # replies

NB: recv_blocking() is mis-leading: we don't have O_NONBLOCK set anyway, so it doesn't change anything about the socket.

In practice, this is usually not a problem: we're either expecting a quick response (for vfu_dma_read/write()), or we expect to be able to write to the pipe anyway (for process_request() replies)

But it's certainly not good for users such as SPDK.

Short reads/writes

If we ever have short reads or writes (indicating that the message couldn't be handled in one go), then we'll totally break (I think), with no way to recover.

All the sock code is at issue here, namely vfu_send_iovec(), get_msg(), and recv_blocking(). If we're in blocking mode, we could at least loop until we've received a full message.

In non-blocking, we could still loop. In the worst case, we'd sit at 100% CPU and hang the runtime (SPDK). The alternative is for us to somehow save the state, and return from vfu_run_ctx() / vfu_dma_read/write()/vfu_irq_message(), so on the next call, we can pick it up and try to finish the read/write. This has a bunch of problems, though. For starters, the library is not re-entrant, so other contexts generally can't come in and call the library again. (And another SPDK thread could call e.g. vfu_dma_read()). To do this we'd have to effectively re-write the entire library to be properly async event loop based.

In practice, again, we're generally OK, We've not yet seen this issue in practice: the pipe buffer is big in comparison to message size (65K), so as long as the other side of the socket is active, we're generally OK. For reads, the messages are also likely to be smaller than PIPE_BUF.

tmakatos commented 3 years ago

If we're supposed to be non-blocking, there are multiple paths whereby we might block still:

I think it's wrong for vfu_dma_read/write()/vfu_irq_message() to wait for a quick response, there's no guarantee that the response will be quick, plus it doesn't work since the client might send a request which we'll mistake for the response we're waiting for (see https://github.com/nutanix/libvfio-user/issues/279 for more details).

Assuming we fix the above, I think we should reconsider looking at fixing it properly, can we really on it working properly the way it's implemented now if the host is under stress (fragmented memory, lots of context switches, lots of open files)?

jlevon commented 3 years ago

It's definitely brittle right now.

There are really only two options to fix:

  1. we go to sleep on a CV until we get a reply matching our ID. This is basically the way the qemu client code works.
  2. we return immediately from the API call, queue up the message+reply, then call a user-provided callback when we're done.

Only 2. can match the SPDK use case, I think, and it's probably not worse in terms of implementation than 1.

It's probably not too painful to do, but I think a lot might depend on callback expectations from SPDK

swapnili commented 3 years ago

At least for now SPDK is not using vfu_dma_read/write()/vfu_irq_message() and not sure if ever use in future. Then until the real use case for handling the reply asynchronously, maybe we keep the behaviour simple as in vfu_dma_read/write()/vfu_irq_message() calls will be blocking until reply received from client.

jlevon commented 3 years ago

We could reject those calls if ATTACH_NB is set, maybe.

We still have the reply path of process_request() though.

tmakatos commented 3 years ago

I think #296 is related to this in the sense that we have to store information in order to be able to send a response from a different context. The thing that makes it easy is that we don't have to support processing more requests while the single request on the migration region is pending. I think this solves the problem that the library is non-reentrant.