spdk / spdk

Storage Performance Development Kit
https://spdk.io/
Other
3.11k stars 1.21k forks source link

bdev_nvme/pcie: Access to SQ/CQ during nvme ctrlr reset can brick certain drives #3088

Open mhae opened 1 year ago

mhae commented 1 year ago

Sighting report

Certain Samsung NVMe drives can brick when SQ/CQ is still being accessed during controller reset (EN 1->0).

In bdev_nvme.c we first disconnect the controller and then the IO queues:

static void
_bdev_nvme_reset_ctrlr(void *ctx)
{
    struct nvme_ctrlr *nvme_ctrlr = ctx;

    assert(nvme_ctrlr->resetting == true);
    assert(nvme_ctrlr->thread == spdk_get_thread());

    if (!spdk_nvme_ctrlr_is_fabrics(nvme_ctrlr->ctrlr)) {
        nvme_ctrlr_disconnect(nvme_ctrlr, bdev_nvme_reset_destroy_qpairs);
    } else {
        bdev_nvme_reset_destroy_qpairs(nvme_ctrlr);
    }
}

This means that during the controller reset (EN 1->0 and 0->1) the IO queues are still being polled. This seems to be a condition that some drives have problems with (and brick). They assume that IO queues are stopped first before a ctrlr reset.

It would be much safer to first destroy the IO qpairs before initiating the controller reset. Note, we also don't want to send the NVMe SQ/CQ deletion command when we are in the process of a reset... these may time out if the drive is in trouble (so would the controller reset but we don't want to run into additional time outs for the SQ/CQ deletion commands).

Previously 21.ish, we used prepare_for_reset to deal with this behavior but we removed the setter for this field.

Further, with the failover/multipath changes, the code underwent quite some changes, so it's unclear how to fit the above into the current bdev_nvme flow.

Expected Behavior

When we initiate a controller reset in bdev_nvme, first stop IOs (destroy IO qpairs w/o executing a SQ/CQ deletion cmd). Then continue with the controller reset.

Possible Solution

Steps to Reproduce

Context (Environment including OS version, SPDK version, etc.)

SPDK 23.01

shuhei-matsumoto commented 1 year ago

@mhae

You mainly drove recent changes for PCIe transport :) Your new proposal is welcome. Will you explain a little more about what is the difference from below, i.e., just removing if-else?

static void
_bdev_nvme_reset_ctrlr(void *ctx)
{
    struct nvme_ctrlr *nvme_ctrlr = ctx;

    assert(nvme_ctrlr->resetting == true);
    assert(nvme_ctrlr->thread == spdk_get_thread());

    bdev_nvme_reset_destroy_qpairs(nvme_ctrlr);
}
mhae commented 1 year ago

@shuhei-matsumoto , thanks for the quick reply and your insight!

I think your proposal will work (at least for 23.01.1 - I'd need your guidance for top of tree). I had to make these changes:

static void
bdev_nvme_reset_ctrlr(struct spdk_io_channel_iter *i, int status)
{
    struct nvme_ctrlr *nvme_ctrlr = spdk_io_channel_iter_get_io_device(i);

    SPDK_DTRACE_PROBE1(bdev_nvme_ctrlr_reset, nvme_ctrlr->nbdev_ctrlr->name);
    assert(status == 0);

    // Old
    // if (!spdk_nvme_ctrlr_is_fabrics(nvme_ctrlr->ctrlr)) {
    //  bdev_nvme_reconnect_ctrlr(nvme_ctrlr);
    // } else {
    //  nvme_ctrlr_disconnect(nvme_ctrlr, bdev_nvme_reconnect_ctrlr);
    // }
    // New
    nvme_ctrlr_disconnect(nvme_ctrlr, bdev_nvme_reconnect_ctrlr);
}

static void
_bdev_nvme_reset(void *ctx)
{
    struct nvme_ctrlr *nvme_ctrlr = ctx;

    SPDK_NOTICELOG("??? enter\n");

    assert(nvme_ctrlr->resetting == true);
    assert(nvme_ctrlr->thread == spdk_get_thread());

    // Old
    // if (!spdk_nvme_ctrlr_is_fabrics(nvme_ctrlr->ctrlr)) {
    //  nvme_ctrlr_disconnect(nvme_ctrlr, bdev_nvme_reset_destroy_qpairs);
    // } else {
    //  bdev_nvme_reset_destroy_qpairs(nvme_ctrlr);
    // }
    // New
    spdk_nvme_ctrlr_prepare_for_reset(nvme_ctrlr->ctrlr);
    bdev_nvme_reset_destroy_qpairs(nvme_ctrlr);
}

I had to bring back the deprecated call spdk_nvme_ctrlr_prepare_for_reset to set ctrlr->prepare_for_reset. Without it, we will issue SQ/CQ NVMe commands to the device which may block in case the drive is unresponsive. So the prepare_for_reset will skip that step since it isn't needed anyway because we are resetting the controller and don't have to explicitly delete the queues.

The nvme_ctrlr_disconnect call is needed as well to do a full ctrlr init. Without it, we will be running into an issue where we do manual aborts but later in the controller init phase call spdk_nvme_qpair_process_completions which processes the previously manually aborted commands.

I want to do some more testing with these changes on 23.01 but so far so good.

Now, the bigger question is how to fix it for top of tree? I saw that there were quite some changes compared to 23.01. We'd also have to bring back the prepare_for_reset call...

mhae commented 1 year ago

@shuhei-matsumoto , another question... Could it be that we are not manually completing the PCIe trackers after the ctrlr is disabled (via nvme_transport_qpair_abort_reqs). I can't seem to find a place where this is happening for the adminq.

shuhei-matsumoto commented 1 year ago

@mhae The change is only about function names. So, fixing it for the master is as easy as for 23.01. But, your proposal means that we revert to v22.01.

static void
_bdev_nvme_reset(void *ctx)
{
    struct nvme_ctrlr *nvme_ctrlr = ctx;

    assert(nvme_ctrlr->resetting == true);
    assert(nvme_ctrlr->thread == spdk_get_thread());

    spdk_nvme_ctrlr_prepare_for_reset(nvme_ctrlr->ctrlr);

    /* First, delete all NVMe I/O queue pairs. */
    spdk_for_each_channel(nvme_ctrlr,
                  bdev_nvme_reset_destroy_qpair,
                  NULL,
                  bdev_nvme_reset_ctrlr);
}

I have not investigated in detail but this reversion is safe? Do we loose any bug fix by this reversion?

mhae commented 1 year ago

@shuhei-matsumoto, I'm not aware of any bug fixes we'd be losing. I've done quite some testing and didn't see any issues except that problem with not aborting all outstanding requests when we're in the disabled state. I was mainly worried about multipathing but the changes I'm proposing should not impact these (hopefully we have some functional tests that would catch issues).

Regarding top of tree: I see 4 instances where we have this sequence:

if (!spdk_nvme_ctrlr_is_fabrics(nvme_ctrlr->ctrlr)) {
        bdev_nvme_reconnect_ctrlr(nvme_ctrlr);
    } else {
        nvme_ctrlr_disconnect(nvme_ctrlr, bdev_nvme_reconnect_ctrlr);
    }

bdev_nvme_disable_destroy_qpairs_done, _bdev_nvme_disconnect_and_disable_ctrlr, bdev_nvme_reset_destroy_qpair_done, _bdev_nvme_reset_ctrlr.

I'm not 100% sure about _bdev_nvme_disconnect_and_disable_ctrlr but it seems that removing the check for the spdk_nvme_ctrlr_is_fabrics should be fine as well.

Would it make sense to raise a PR?

shuhei-matsumoto commented 1 year ago

@mhae. Thank you for your test and code inspection.

https://github.com/spdk/spdk/issues/2360 was the reason to distinguish fabric and non-fabric via https://github.com/spdk/spdk/commit/fcf52fbff58345f52665dfd627488836df4ef755

Will you check these? If OK, it is enough to push patches to gerrit.

Thanks.

mhae commented 1 year ago

@shuhei-matsumoto , sorry for the late reply. Thanks for reminding me of that original issue... there were actually 2 problems way back: (1) nvme_transport_qpair_abort_reqs was called too early in nvme_qpair_abort_requests (21.04) (2) SQ/CQ delete was done first but that can wait forever when the drive is having issues. So this step was skipped via prepare_for_reset.

Now the latest issue with the current code (disable ctrlr then destroy/reconnect qpairs), we touch the SQ/CQ during ctrlr reset which apparently some drives don't like. I agree that changing the code back what I proposed is not ideal given (2).

I wonder if we somehow could pause the SQ/CQ processing before we start the ctrlr disconnect. Any suggestions?

shuhei-matsumoto commented 1 year ago

Hi @mhae,

Thank you for your kind and detailed reply. Then, the following is better if possible? 1) delete SQ/CQ for qpairs; 2) disable controller; 3) abort I/Os for qpairs

However, #2360 is a potential issue? How likely will #2360 occurr? Can we revert to v22.01?

Please share which change is better for you.

mhae commented 1 year ago

Hi @shuhei-matsumoto, thanks for continuing to brainstorm.

I'm not sure about reverting.

The sequence that would be the best for dealing with 2360 and a potentially unresponsive drive would be something like this: 1) Pause the IO qpairs. No submit/completion processing is done in this state. 2) Reset ctrlr (disable/enable) 3) Manually abort IOs on paused IO qpairs and re-connect if ctrlr reset was successful

Going through the SQ/CQ deletion process does not work when the drive is having issues (today we have a blocking wait which would tie up the core; this was addressed with prepare_for_reset).

One of the issues is that when we disconnect a qpair, we issue CQ/SQ delete command and manually complete trackers at the end. As mentioned, the prepare_for_reset skips the first part.

I was wondering if we could remove the qpair from the poll group processing in bdev_nvme?

shuhei-matsumoto commented 1 year ago

Hi @mhae,

Thank you for sharing your additional thought and idea.

I was wondering if we could remove the qpair from the poll group processing in bdev_nvme?

To pause I/O qpair in poll group will be better because we need poll group to disconnect and delete I/O qpair.

Then, I think asynchronous SQ/CQ deletion with timeout is better than prepare_for_reset, but async SQ/CQ deletion is not easy as https://review.spdk.io/gerrit/c/spdk/spdk/+/18509

Hence, I will agree reverting prepare_for_reset for now.

I will try creating and sharing patch to pause I/O qpair in poll group. But it's fine if you do it yourself.

shuhei-matsumoto commented 1 year ago

Hi @mhae, I uploaded a patch to pause a qpair in a poll group https://review.spdk.io/gerrit/c/spdk/spdk/+/19465 Will you check the patch and try your idea?

mhae commented 1 year ago

@shuhei-matsumoto , thanks so much for that patch. It seems to pause the queues nicely.

I haven't had a chance to do some stress testing but I will do that on tomorrow.

With your change and some changes to the reset logic, I now have the following reset sequence:

I need to run some stress tests to see what shakes out. Thanks again for the patch.

tomzawadzki commented 1 year ago

[Bug scrub] We are gearing up for SPDK 23.09 release. Is the proposed patch by @shuhei-matsumoto right solution for you and is it required to be included in the next release ?

mhae commented 1 year ago

Hi Tomasz, no it's not ready yet. This patch doesn't need to be included in the next release.

On Tue, Sep 5, 2023, 8:41 AM Tomasz Zawadzki @.***> wrote:

[Bug scrub] We are gearing up for SPDK 23.09 release. Is the proposed patch by @shuhei-matsumoto https://github.com/shuhei-matsumoto right solution for you and is it required to be included in the next release ?

— Reply to this email directly, view it on GitHub https://github.com/spdk/spdk/issues/3088#issuecomment-1706861377, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOD7MARKULWMIYNQJ42IQLXY5B3BANCNFSM6AAAAAA3ADJ4RU . You are receiving this because you were mentioned.Message ID: @.***>