NVMe-oF shutdown test latent failures I/O connect with duplicate QID

Seth5141 commented 4 years ago

Current Behavior

I hit this failure on a one of my patches today (December 5 2019)

Context (Environment including OS version, SPDK version, etc.)

nvmf_tcp_autotest

yidong0635 commented 4 years ago

@Seth5141 ,https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_51590.html seems check_format failure, not hang issue.

spdkci commented 4 years ago

Another instance of this failure. Reported by @Seth5141. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_51761.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @ksztyber. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_52005.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_52332.html

jimharris commented 4 years ago

Bug scrub: assigned to @Seth5141

Seth5141 commented 4 years ago

I did a triage of the three confirmed cases of this failure today. It looks like they actually fall into three different categories

NVMe-oF target failures related to asynchronous ib message handling. This class of errors is actually handled by the solution to GitHub issue #1075. The failure at 51761 falls under this category.
NVMe-oF target failures related to asynchronous cm event handling. This class of errors is closely related to #1075, but since it involves cm events, it isn't actually resolved by that solution. I have opened a new issue at #1110. The failure at 52005 falls under this category.
Bdevperf failure due to a failed assert during shutdown. I am not sure exactly what is causing this error yet, so I am leaving it on this issue for the time being. The failure at 52332 falls under this category

shuhei-matsumoto commented 4 years ago

@Seth5141

I'm working on refining bdevperf now, and I'm seeing the following intermittent issues.

https://dqtibwqq6s6ux.cloudfront.net/results/autotest-per-patch/builds/54280/archive/nvmf_phy_autotest/index.html

https://dqtibwqq6s6ux.cloudfront.net/results/autotest-per-patch/builds/54328/archive/nvmf_phy_autotest/index.html

Will you take a look at the log and will you share your idea about what's wrong? I think my patch caused this issue though.

Starting SPDK v20.01-pre git sha1 7463a5b518 / DPDK 19.08.0 initialization...
[ DPDK EAL parameters: bdevperf --no-shconf -c 0x1 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid3795757 ]
EAL: No available hugepages reported in hugepages-1048576kB
EAL: VFIO support initialized
app.c: 642:spdk_app_start: *NOTICE*: Total cores available: 1
reactor.c: 346:_spdk_reactor_run: *NOTICE*: Reactor started on core 0
Running I/O for 10 seconds...
ctrlr.c: 199:ctrlr_add_qpair_and_update_rsp: *ERROR*: Got I/O connect with duplicate QID 1
nvme_fabric.c: 395:nvme_fabric_qpair_connect: *ERROR*: Connect command failed
nvme_rdma.c:1072:nvme_rdma_qpair_connect: *ERROR*: Failed to send an NVMe-oF Fabric CONNECT command
nvme_ctrlr.c: 363:spdk_nvme_ctrlr_alloc_io_qpair: *ERROR*: nvme_transport_ctrlr_create_io_qpair() failed

shuhei-matsumoto commented 4 years ago

Hi @Seth5141

I changed only spdk_event_call to spdk_for_each_channel (spdk_thread_send_msg) in bdevperf but I found that this change caused frequent failures for some unknown reason. So please ignore the last my comment. It was not related with this issue.

ak-gh commented 4 years ago

On Fri, Dec 27, 2019 at 8:07 AM shuhei-matsumoto notifications@github.com wrote:

Hi @Seth5141 https://github.com/Seth5141

I changed only spdk_event_call to spdk_for_each_channel (spdk_thread_send_msg) in bdevperf but I found that this change caused frequent failures for some unknown reason.

I believe it might be related, so please review https://review.gerrithub.io/c/spdk/spdk/+/478903.

Regards, Andrey

So please ignore the last my comment. It was not related with this issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/spdk/spdk/issues/1096?email_source=notifications&email_token=AJG6WB3ALUMIGSEF3MR2ZM3Q2WERFA5CNFSM4JWESI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWSHTY#issuecomment-569189327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJG6WB4R7TSC43FYZQVI3FTQ2WERFANCNFSM4JWESI3Q .

shuhei-matsumoto commented 4 years ago

Hi @ak-gh

Thank you for the patch. I was wrong that I thought the issue I have seen was existing but knew that I created it by my patches. So I think we should move from here to gerrithub.

I think your idea is similar with Jin's https://github.com/spdk/spdk/tree/master/examples/nvmf/nvmf and patch series from https://review.gerrithub.io/c/spdk/spdk/+/468657/21.

On the other hand, what I want to pursue for bdevperf now is to use the existing reactor threads and I/O device framework like SPDK NVMe-oF target and SPDK iSCSI target.

As long as SPDK NVMe-oF target and iSCSI target use reactor threads and I/O device framework, it will be helpful for us to have more handy tool that uses the same framework. My idea is alternative as Jin's and yours.

I'm fine if we have two types of bdevperf, creating new lightweight bdevperf threads, and using reactor threads for bdevperf tool.

I will be out of office most of next week and so response will be delayed but your patch will have important idea and I will definitely review it.

Thanks, Shuhei

shuhei-matsumoto commented 4 years ago

On the other hand, the issue I had seen was gone by inserting 1 second sleep after bdev perf ran. I will ask experts their ideas

shuhei-matsumoto commented 4 years ago

Hi @ak-gh

I believe your patch and my patches will be able to work well together.. I feel your patch will match https://trello.com/c/SeKQ0IFU/145-remove-assumption-from-applications-that-spdkthreads-pre-exist-maciek-tomek-vitaliy . On the other hand, my patches will try to use more thread APIs for bdevperf. I want to correct my previous comment.

-Shuhei

ak-gh commented 4 years ago

Hi Shuhei,

I have no problem with merging patches if you wish. Please feel free to combine mine and yours as you see fit.

Regards, Andrey

On Sat, Dec 28, 2019, 10:09 shuhei-matsumoto notifications@github.com wrote:

Hi @ak-gh https://github.com/ak-gh

I believe your patch and my patches will be able to work well together.. I feel your patch will match https://trello.com/c/SeKQ0IFU/145-remove-assumption-from-applications-that-spdkthreads-pre-exist-maciek-tomek-vitaliy . On the other hand, my patches will try to use more thread APIs for bdevperf. I want to correct my previous comment.

-Shuhei

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spdk/spdk/issues/1096?email_source=notifications&email_token=AJG6WB3IPJSLTEPGVRSCK7DQ233SJA5CNFSM4JWESI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHYEEZY#issuecomment-569393767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJG6WB4CUQE3JLIJY34MC73Q233SJANCNFSM4JWESI3Q .

Seth5141 commented 4 years ago

@shuhei-matsumoto I believe that your changes are fine. After looking through them, I believe that your changes simply changed the timing on the initiator side. The qpair cleanup on the target side is asynchronous so we have always been racing to clean up the old qpairs before initiating new connections from the second instance of bdevperf. The only reason the tests fail is because bdevperf doesn't delay and attempt to reconnect later if we get a connect reject event. I think that your current solution (adding a sleep before TC2) is the right one.

shuhei-matsumoto commented 4 years ago

I saw #3 case for TCP transport.

https://dqtibwqq6s6ux.cloudfront.net/results/autotest-per-patch/builds/54821/archive/nvmf_tcp_autotest/build.log

in test/nvmf/target/shutdown.sh:89 -> nvmf_shutdown_tc3() ... nvme_qpair.c: 475:spdk_nvme_qpair_process_completions: ERROR: CQ error, abort requests after transport retry counter exceeded nvme_ctrlr.c: 677:nvme_ctrlr_fail: ERROR: ctrlr 127.0.0.1 in failed state. nvme_ctrlr.c:1072:spdk_nvme_ctrlr_reset: NOTICE: resetting controller posix.c: 338:spdk_posix_sock_create: ERROR: connect() failed, errno = 111 nvme_tcp.c:1618:nvme_tcp_qpair_connect: ERROR: sock connection error of tqpair=0x14a0090 with addr=127.0.0.1, port=4420 nvme_ctrlr.c:1094:spdk_nvme_ctrlr_reset: ERROR: Controller reinitialization failed. nvme_ctrlr.c: 677:nvme_ctrlr_fail: ERROR: ctrlr 127.0.0.1 in failed state. bdev_nvme.c: 305:_bdev_nvme_reset_complete: ERROR: Resetting controller failed. 84 trap 'process_shm --id $NVMF_APP_SHM_ID; kill -9 $perfpid || true; nvmftestfini; exit 1' SIGINT SIGTERM EXIT

spdkci commented 4 years ago

Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_54821.html

shuhei-matsumoto commented 4 years ago

Thank you @Seth5141 for your analysis!

spdkci commented 4 years ago

Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55102.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @jimharris. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55191.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55352.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55438.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @peluse. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_56307.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58568.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58564.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58566.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58686.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58765.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58715.html

Donaim commented 4 years ago

Instance with RDMA on vs-dpdk job: https://ci.spdk.io/results/autotest-spdk-19-10-vs-dpdk-master/builds/1256/archive/nvmf_autotest_nightly/build.log

spdkci commented 4 years ago

Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_59934.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @tomzawadzki. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_60387.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @benlwalker. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_60323.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @AlekseyMarchuk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_60113.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_3281.html

jimharris commented 4 years ago

All of the most recent failures are due to the this signature:

app.c: 642:spdk_app_start: NOTICE: Total cores available: 1 reactor.c: 346:_spdk_reactor_run: NOTICE: Reactor started on core 0 Running I/O for 10 seconds... ctrlr.c: 199:ctrlr_add_qpair_and_update_rsp: ERROR: Got I/O connect with duplicate QID 1 nvme_fabric.c: 395:nvme_fabric_qpair_connect: ERROR: Connect command failed nvme_rdma.c:1072:nvme_rdma_qpair_connect: ERROR: Failed to send an NVMe-oF Fabric CONNECT command nvme_ctrlr.c: 363:spdk_nvme_ctrlr_alloc_io_qpair: ERROR: nvme_transport_ctrlr_create_io_qpair() failed

We already have some random sleeps in the test script which I think are trying to workaround this problem. See the two sleeps in nvmf_shutdown_tc1. I'm thinking that if we remove those sleeps in a test patch we'll see this issue occur much more frequently and will be easier to reproduce locally and fix.

shuhei-matsumoto commented 4 years ago

Hi @jimharris @Seth5141 and all,

If I remember correctly, removing the delay after nvmf_shutdown_tc1 caused this issue almost always, solidly.

Another solution we discussed before might be to fix bdevperf tool. However we have seen this issue in other IO tools too. We will need to add fix to other layer

So Jim’s proposal sounds reasonable but discussing our direction of fix before removing the delay may be better.

jimharris commented 4 years ago

Chatted with @benlwalker on this issue offline. We think there could be an issue with cntlid generation which would cause the new process to connect to the controller from the old process before the target fully cleaned it up. I've added some instrumentation to my test patch and will run it through the test pool a few times to hopefully reproduce the failure. (Test patch passed first time after removing the sleeps.)

shuhei-matsumoto commented 4 years ago

Hi @jimharris Thanks for sharing your current thought and work.

May I confirm my understanding? SPDK NVMe-oF target have used dynamic controller model but may have some cases such that available controller is not allocated. So The idea is to ensure to the host that available controller is allocated?

jimharris commented 4 years ago

We always specify cntlid = 0xFFFF in the connect capsule for the admin queue. This tells the target to allocate a new cntlid.

So assume first process starts: admin queue connect (qid = 0, cntlid = 0xFFFF) - target allocates cntlid = 1 io queue connect (qid = 1, cntlid = 1) - success

Then the first process exits and a second process starts. The process should be: admin queue connect (cntlid = 0xFFFF) - target allocates cntlid = 2 io queue connect (qid = 1, cntlid = 2) - success

but sometimes this second io queue connect fails - it says that QID 1 is already allocated. The suspicion is that when this fails, it is because an incorrect cntlid was returned for the second admin queue connect. But it's not clear yet. So the instrumentation will give us more insight into whether the failure is due to a corrupted cntlid and indicate what to pursue next.

spdkci commented 4 years ago

Another instance of this failure. Reported by @benlwalker. log: https://ci.spdk.io/public_build/autotest-per-patch_5145.html

optimistyzy commented 4 years ago

Hi @jimharris,

For your consideration on cntlid may be the possibility, but in the code cntlid is generated by the logic in the same thread (and cntlid is linearly increased with a loop). I think it may be related qpair_mask which is not handled in the same thread, if you search "qpair_mask" in the code base. I found that qpair_mask seems not all handled by the same thread "ctrlr->thread" . And the following patch may guarantee this: https://review.spdk.io/gerrit/c/spdk/spdk/+/1226

spdkci commented 4 years ago

Another instance of this failure. Reported by @glodfish-zg. log: https://ci.spdk.io/public_build/autotest-per-patch_6054.html

jimharris commented 4 years ago

Thanks Ziye. I'm still trying to reproduce the original failure with this patch:

https://review.spdk.io/gerrit/c/spdk/spdk/+/1046

It removes the sleeps between test cases, which we think should make the problem occur more frequently. The problem is that when the second process connects, it should get a new cntlid meaning a new qpair_mask as well. The attached patch also adds some more debug prints around the generated cntlids - hopefully we can get that patch to fail, observed the extra logging, and then hopefully your patch is something that might fix it.

spdkci commented 4 years ago

Another instance of this failure. Reported by @glodfish-zg. log: https://ci.spdk.io/public_build/autotest-per-patch_6544.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @glodfish-zg. log: https://ci.spdk.io/public_build/autotest-per-patch_7082.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @tomzawadzki. log: https://ci.spdk.io/public_build/autotest-per-patch_10223.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @Seth5141. log: https://ci.spdk.io/public_build/autotest-per-patch_10545.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @Seth5141. log: https://ci.spdk.io/public_build/autotest-per-patch_10551.html

spdkci commented 4 years ago

Another instance of this failure. Reported by @Seth5141. log: https://ci.spdk.io/public_build/autotest-per-patch_10667.html

Seth5141 commented 4 years ago

I have this root caused and a patch series out to address the issue.

The problem is this. Historically when we disconnected a qpair from the initiator, we simply free all of the initiator side memory and called it a day. In the case where we were freeing a qpair from the bdev layer, we went one step further and just called spdk_nvme_ctrlr_free_io_qpair directly which calls disconnect and then immediately destroys the qpair. It also clears the qid bit in the controller side.

This takes us to the target side. When the initiator clears all of its memory, we get an RDMA_CM_EVENT_DISCONNECTED message at some point (probably pretty soon after the initiator "disconnect"). We then proceed to tear down the qpair on the target side, eventually clearing the qid on the target side as well.

However, the target can clean up its qid list after the initiator tries to connect another qpair with the same qid it just connected. This race has always been there, but I think as bdevperf has been getting more efficient, it's been popping up more and more.

The fix is actually pretty simple. the rdma core library provides a function "rdma_disconnect" with can be called from both sides of the connection and causes an RDMA_CM_EVENT_DISCONNECTED message to be generated on the other side. So when disconnecting on the initiator, we can call this function and then wait for the target to disconnect. It will also call rdma_disconnect notifying us that its done and we can feel safe that we are not racing with the target to clean up the qid lists.

Patch series here: https://review.spdk.io/gerrit/c/spdk/spdk/+/1879

spdk / spdk

NVMe-oF shutdown test latent failures I/O connect with duplicate QID #1096

Current Behavior

Context (Environment including OS version, SPDK version, etc.)