Closed tmakatos closed 2 years ago
I think existing live migration can't work with interrupt mode, because we don't restore interrupt related settings in the destination VM.
I've edited the log excerpt above. The first line ("transition from state resuming to state running") shows that the controller is in running state however it hasn't set up the queues etc. I've added additional debugging I can see nvmf_vfio_user_poll_group_poll
getting called but it doesn't do anything because of this:
if (spdk_unlikely(sq->sq_state != VFIO_USER_SQ_ACTIVE || !sq->size)) {
continue;
}
Ideally we should not return from the migration callback until the controller is up an running, however that's not currently possible because the callbacks cannot return -EBUSY. Alternatively we could leave as it is and re-check the doorbells in case we missed anything once the controller is up and running, but I'm worried we could miss other events. @jlevon thoughts?
I'm looking at the code, specifically at vfio_user_migr_ctrlr_enable_sqs
:
if (nvmf_qpair_is_admin_queue(&sq->qpair)) {
/* ADMIN queue pair is always in the poll group, just enable it */
sq->sq_state = VFIO_USER_SQ_ACTIVE;
} else {
spdk_nvmf_tgt_new_qpair(vu_ctrlr->transport->transport.tgt, &sq->qpair);
}
It doesn't look like we're going to miss events for the admin queue, so if we make sure that the last I/O to connect checks the doorbells we might avoid this problem. It still feels like a hack though.
Taking a step back, I think expecting from the device to switch instantly from resuming state to running state is a bit too much to ask from the device. Maybe we could make the transition callback to return -EBUSY and introduce a transition_done
callback? This way nvmf/vfio-user can properly resume and avoid all sorts of hacks. The infrastructure is already there for the quiesce callback, we might as well reuse it.
This hack seems to fix the problem:
diff --git a/lib/nvmf/vfio_user.c b/lib/nvmf/vfio_user.c
index b0339174b..2cd526097 100644
--- a/lib/nvmf/vfio_user.c
+++ b/lib/nvmf/vfio_user.c
@@ -378,6 +378,8 @@ struct nvmf_vfio_user_ctrlr {
/* internal CSTS.CFS register for vfio-user fatal errors */
uint32_t cfs : 1;
+
+ int qpairs;
};
struct nvmf_vfio_user_endpoint {
@@ -3264,6 +3267,10 @@ vfio_user_migr_ctrlr_construct_qps(struct nvmf_vfio_user_ctrlr *vu_ctrlr,
return -EFAULT;
}
cqs_ref[sq->cqid]++;
+
+ if (sqid != 0) {
+ vu_ctrlr->qpairs++;
+ }
}
}
@@ -4503,6 +4510,7 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
struct nvmf_vfio_user_cq *cq;
struct nvmf_vfio_user_ctrlr *vu_ctrlr;
struct nvmf_vfio_user_endpoint *endpoint;
+ bool check_doorbells = false;
assert(sq != NULL);
assert(req != NULL);
@@ -4578,6 +4586,10 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
sq->create_io_sq_cmd.cid, SPDK_NVME_SC_SUCCESS, SPDK_NVME_SCT_GENERIC);
}
sq->post_create_io_sq_completion = false;
+ } else {
+ if (--vu_ctrlr->qpairs == 0) {
+ check_doorbells = true;
+ }
}
sq->sq_state = VFIO_USER_SQ_ACTIVE;
}
@@ -4588,6 +4600,10 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
free(req->req.data);
req->req.data = NULL;
+ if (check_doorbells) {
+ vfio_user_handle_intr(vu_ctrlr);
+ }
+
return 0;
}
I actually don't think it's a hack, it's the right thing to do: we have an async state, and we need to "catch up" on the transition.
But I'd pull in self_kick() from the shadow doorbell patch for this, then do:
@@ -4578,6 +4586,10 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
sq->create_io_sq_cmd.cid, SPDK_NVME_SC_SUCCESS, SPDK_NVME_SCT_GENERIC);
}
sq->post_create_io_sq_completion = false;
+ } else {
+ /* comment here that ctrlr was running, might have got bar0 doorbell writes, need to catch up */
+ self_kick(vu_ctrlr);
}
sq->sq_state = VFIO_USER_SQ_ACTIVE;
}
We're calling self_kick()
before sq->sq_state = VFIO_USER_SQ_ACTIVE;
and TAILQ_INSERT_TAIL(&vu_ctrlr->connected_sqs, sq, tailq);
, is vfio_user_handle_intr_wrapper()
guaranteed to execute after handle_queue_connect_rsp()
has returned? I did test your patch and it seems to work.
Also, will this work with multiple threads/reactors in SPDK?
guaranteed to execute after handle_queue_connect_rsp() has returned
Sure, we're not going to get pre-empted :)
Also, will this work with multiple threads/reactors in SPDK?
right now yes, since we keep all a controller's queues on the same poll group in interrupt mode. Later on, we will need to steer the message to the reactor that has the particular queue, I suppose.
if vfio_user_handle_intr(vu_ctrlr);
is required in handle_queue_connect_rsp
, I think you can just call it when interrupt was enabled, regardless the migration status.
@changpe1 while trying to fix live migration with shadow doorbells I ran into a problem where a request gets stuck. I used dd(1) with one outstanding I/O and then migrated the VM, the request timed out at the destination. It turns out that this bug is irrelevant to shadow doorbells, it's because interrupt mode is enabled, I confirmed by reverting John's patch. Here's an example:
Looks like a race condition where the BAR0 doorbells is written before the queue is set up?