[nvmf/vfio-user] request gets stuck when live migrating in interrupt mode

tmakatos commented 2 years ago

@changpe1 while trying to fix live migration with shadow doorbells I ran into a problem where a request gets stuck. I used dd(1) with one outstanding I/O and then migrated the VM, the request timed out at the destination. It turns out that this bug is irrelevant to shadow doorbells, it's because interrupt mode is enabled, I confirmed by reverting John's patch. Here's an example:

[2022-03-02 17:54:51.328995] vfio_user.c:2155:vfio_user_log: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: migration: transition from state resuming to state running
...
[2022-03-02 17:54:51.329428] vfio_user.c:3932:nvmf_vfio_user_poll_group_add: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: add QP5=0x7f9e883dcc20(0x7f9e883dcc20) to poll_group=0x7f9e88014330
[2022-03-02 17:54:51.329440] vfio_user.c:3966:nvmf_vfio_user_poll_group_add: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: sending connect fabrics command for QID=0x5 cntlid=0x1
[2022-03-02 17:54:51.329452] ctrlr.c: 699:_nvmf_ctrlr_connect: *DEBUG*: recfmt 0x0 qid 5 sqsize 255
...
# advance CQ5
[2022-03-02 17:54:51.333645] vfio_user.c:2155:vfio_user_log: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: W 0 0x102c-0x1030
[2022-03-02 17:54:51.333660] vfio_user.c:2100:access_bar0_fn: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: bar0 write ctrlr: 0x7f9e88022b30, count=4, pos=102C
...
# submit to SQ5
[2022-03-02 17:54:51.335325] vfio_user.c:2155:vfio_user_log: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: W 0 0x1028-0x102c
[2022-03-02 17:54:51.335343] vfio_user.c:2100:access_bar0_fn: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: bar0 write ctrlr: 0x7f9e88022b30, count=4, pos=1028
...
[2022-03-02 17:54:51.340111] nvme_qpair.c: 454:spdk_nvme_print_completion: *NOTICE*: SUCCESS (00/00) qid:5 cid:0 cdw0:1 sqhd:0000 p:0 m:0 dnr:0
...
# read CSTS
[2022-03-02 17:55:22.161272] vfio_user.c:2155:vfio_user_log: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: R 0 0x1c-0x20
[2022-03-02 17:55:22.161304] vfio_user.c:2100:access_bar0_fn: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: bar0 read ctrlr: 0x7f9e88022b30, count=4, pos=1C
[2022-03-02 17:55:22.161318] ctrlr.c:1328:nvmf_property_get: *DEBUG*: size 0, offset 0x1c
[2022-03-02 17:55:22.161328] ctrlr.c:1352:nvmf_property_get: *DEBUG*: name: csts
[2022-03-02 17:55:22.161335] ctrlr.c:1370:nvmf_property_get: *DEBUG*: response value: 0x1
[2022-03-02 17:55:22.161371] nvme_qpair.c: 454:spdk_nvme_print_completion: *NOTICE*: SUCCESS (00/00) qid:0 cid:0 cdw0:1 sqhd:0000 p:0 m:0 dnr:0
# handle cid 97 after 30 secs
[2022-03-02 17:55:22.161564] vfio_user.c:4250:handle_cmd_req: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: handle qid5, req opc=0x2 cid=97
[2022-03-02 17:55:22.161583] vfio_user.c:2155:vfio_user_log: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: map 0x21b805000-0x21b806000
[2022-03-02 17:55:22.161599] nvme_qpair.c: 270:nvme_io_qpair_print_command: *NOTICE*: READ sqid:5 cid:97 nsid:1 lba:160352 len:8 PRP1 0x21b805000 PRP2 0x0
[2022-03-02 17:55:22.161692] vfio_user.c:2155:vfio_user_log: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: W 0 0x1000-0x1004
# attempt to abort cid 97
[2022-03-02 17:55:22.161707] vfio_user.c:2100:access_bar0_fn: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: bar0 write ctrlr: 0x7f9e88022b30, count=4, pos=1000
[2022-03-02 17:55:22.161739] vfio_user.c:4250:handle_cmd_req: *DEBUG*: /var/tmp/vfio-user//3803b229-7cad-4624-9e35-cc890023deae/nvme/3803b229-7cad-4624-9e35-cc890023deae: handle qid0, req opc=0x8 cid=23
[2022-03-02 17:55:22.161754] nvme_qpair.c: 250:nvme_admin_qpair_print_command: *NOTICE*: ABORT (08) qid:0 cid:23 nsid:0 cdw10:00610005 cdw11:00000000

Looks like a race condition where the BAR0 doorbells is written before the queue is set up?

changpe1 commented 2 years ago

I think existing live migration can't work with interrupt mode, because we don't restore interrupt related settings in the destination VM.

tmakatos commented 2 years ago

I've edited the log excerpt above. The first line ("transition from state resuming to state running") shows that the controller is in running state however it hasn't set up the queues etc. I've added additional debugging I can see nvmf_vfio_user_poll_group_poll getting called but it doesn't do anything because of this:

if (spdk_unlikely(sq->sq_state != VFIO_USER_SQ_ACTIVE || !sq->size)) {
    continue;
}

Ideally we should not return from the migration callback until the controller is up an running, however that's not currently possible because the callbacks cannot return -EBUSY. Alternatively we could leave as it is and re-check the doorbells in case we missed anything once the controller is up and running, but I'm worried we could miss other events. @jlevon thoughts?

tmakatos commented 2 years ago

I'm looking at the code, specifically at vfio_user_migr_ctrlr_enable_sqs:

if (nvmf_qpair_is_admin_queue(&sq->qpair)) {
    /* ADMIN queue pair is always in the poll group, just enable it */
    sq->sq_state = VFIO_USER_SQ_ACTIVE;
} else {
    spdk_nvmf_tgt_new_qpair(vu_ctrlr->transport->transport.tgt, &sq->qpair);
}

It doesn't look like we're going to miss events for the admin queue, so if we make sure that the last I/O to connect checks the doorbells we might avoid this problem. It still feels like a hack though.

Taking a step back, I think expecting from the device to switch instantly from resuming state to running state is a bit too much to ask from the device. Maybe we could make the transition callback to return -EBUSY and introduce a transition_done callback? This way nvmf/vfio-user can properly resume and avoid all sorts of hacks. The infrastructure is already there for the quiesce callback, we might as well reuse it.

tmakatos commented 2 years ago

This hack seems to fix the problem:

diff --git a/lib/nvmf/vfio_user.c b/lib/nvmf/vfio_user.c
index b0339174b..2cd526097 100644
--- a/lib/nvmf/vfio_user.c
+++ b/lib/nvmf/vfio_user.c
@@ -378,6 +378,8 @@ struct nvmf_vfio_user_ctrlr {

        /* internal CSTS.CFS register for vfio-user fatal errors */
        uint32_t                                cfs : 1;
+
+       int qpairs;
 };

 struct nvmf_vfio_user_endpoint {
@@ -3264,6 +3267,10 @@ vfio_user_migr_ctrlr_construct_qps(struct nvmf_vfio_user_ctrlr *vu_ctrlr,
                                return -EFAULT;
                        }
                        cqs_ref[sq->cqid]++;
+
+                       if (sqid != 0) {
+                               vu_ctrlr->qpairs++;
+                       }
                }
        }

@@ -4503,6 +4510,7 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
        struct nvmf_vfio_user_cq *cq;
        struct nvmf_vfio_user_ctrlr *vu_ctrlr;
        struct nvmf_vfio_user_endpoint *endpoint;
+       bool check_doorbells = false;

        assert(sq != NULL);
        assert(req != NULL);
@@ -4578,6 +4586,10 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
                                                sq->create_io_sq_cmd.cid, SPDK_NVME_SC_SUCCESS, SPDK_NVME_SCT_GENERIC);
                        }
                        sq->post_create_io_sq_completion = false;
+               } else {
+                       if (--vu_ctrlr->qpairs == 0) {
+                               check_doorbells = true;
+                       }
                }
                sq->sq_state = VFIO_USER_SQ_ACTIVE;
        }
@@ -4588,6 +4600,10 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
        free(req->req.data);
        req->req.data = NULL;

+       if (check_doorbells) {
+               vfio_user_handle_intr(vu_ctrlr);
+       }
+
        return 0;
 }

jlevon commented 2 years ago

I actually don't think it's a hack, it's the right thing to do: we have an async state, and we need to "catch up" on the transition.

But I'd pull in self_kick() from the shadow doorbell patch for this, then do:

@@ -4578,6 +4586,10 @@ handle_queue_connect_rsp(struct nvmf_vfio_user_req *req, void *cb_arg)
                                                sq->create_io_sq_cmd.cid, SPDK_NVME_SC_SUCCESS, SPDK_NVME_SCT_GENERIC);
                        }
                        sq->post_create_io_sq_completion = false;
+               } else {
+                     /* comment here that ctrlr was running, might have got bar0 doorbell writes, need to catch up */
+                    self_kick(vu_ctrlr);
                }
                sq->sq_state = VFIO_USER_SQ_ACTIVE;
        }

tmakatos commented 2 years ago

We're calling self_kick() before sq->sq_state = VFIO_USER_SQ_ACTIVE; and TAILQ_INSERT_TAIL(&vu_ctrlr->connected_sqs, sq, tailq);, is vfio_user_handle_intr_wrapper() guaranteed to execute after handle_queue_connect_rsp() has returned? I did test your patch and it seems to work.

Also, will this work with multiple threads/reactors in SPDK?

jlevon commented 2 years ago

guaranteed to execute after handle_queue_connect_rsp() has returned

Sure, we're not going to get pre-empted :)

Also, will this work with multiple threads/reactors in SPDK?

right now yes, since we keep all a controller's queues on the same poll group in interrupt mode. Later on, we will need to steer the message to the reactor that has the particular queue, I suppose.

changpe1 commented 2 years ago

if vfio_user_handle_intr(vu_ctrlr); is required in handle_queue_connect_rsp, I think you can just call it when interrupt was enabled, regardless the migration status.

spdk / spdk

[nvmf/vfio-user] request gets stuck when live migrating in interrupt mode #2410