Closed Seth5141 closed 4 years ago
@Seth5141 ,https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_51590.html seems check_format failure, not hang issue.
Another instance of this failure. Reported by @Seth5141. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_51761.html
Another instance of this failure. Reported by @ksztyber. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_52005.html
Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_52332.html
Bug scrub: assigned to @Seth5141
I did a triage of the three confirmed cases of this failure today. It looks like they actually fall into three different categories
@Seth5141
I'm working on refining bdevperf now, and I'm seeing the following intermittent issues.
Will you take a look at the log and will you share your idea about what's wrong? I think my patch caused this issue though.
Starting SPDK v20.01-pre git sha1 7463a5b518 / DPDK 19.08.0 initialization...
[ DPDK EAL parameters: bdevperf --no-shconf -c 0x1 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid3795757 ]
EAL: No available hugepages reported in hugepages-1048576kB
EAL: VFIO support initialized
app.c: 642:spdk_app_start: *NOTICE*: Total cores available: 1
reactor.c: 346:_spdk_reactor_run: *NOTICE*: Reactor started on core 0
Running I/O for 10 seconds...
ctrlr.c: 199:ctrlr_add_qpair_and_update_rsp: *ERROR*: Got I/O connect with duplicate QID 1
nvme_fabric.c: 395:nvme_fabric_qpair_connect: *ERROR*: Connect command failed
nvme_rdma.c:1072:nvme_rdma_qpair_connect: *ERROR*: Failed to send an NVMe-oF Fabric CONNECT command
nvme_ctrlr.c: 363:spdk_nvme_ctrlr_alloc_io_qpair: *ERROR*: nvme_transport_ctrlr_create_io_qpair() failed
Hi @Seth5141
I changed only spdk_event_call to spdk_for_each_channel (spdk_thread_send_msg) in bdevperf but I found that this change caused frequent failures for some unknown reason. So please ignore the last my comment. It was not related with this issue.
On Fri, Dec 27, 2019 at 8:07 AM shuhei-matsumoto notifications@github.com wrote:
Hi @Seth5141 https://github.com/Seth5141
I changed only spdk_event_call to spdk_for_each_channel (spdk_thread_send_msg) in bdevperf but I found that this change caused frequent failures for some unknown reason.
I believe it might be related, so please review https://review.gerrithub.io/c/spdk/spdk/+/478903.
Regards, Andrey
So please ignore the last my comment. It was not related with this issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/spdk/spdk/issues/1096?email_source=notifications&email_token=AJG6WB3ALUMIGSEF3MR2ZM3Q2WERFA5CNFSM4JWESI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWSHTY#issuecomment-569189327, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJG6WB4R7TSC43FYZQVI3FTQ2WERFANCNFSM4JWESI3Q .
Hi @ak-gh
Thank you for the patch. I was wrong that I thought the issue I have seen was existing but knew that I created it by my patches. So I think we should move from here to gerrithub.
I think your idea is similar with Jin's https://github.com/spdk/spdk/tree/master/examples/nvmf/nvmf and patch series from https://review.gerrithub.io/c/spdk/spdk/+/468657/21.
On the other hand, what I want to pursue for bdevperf now is to use the existing reactor threads and I/O device framework like SPDK NVMe-oF target and SPDK iSCSI target.
As long as SPDK NVMe-oF target and iSCSI target use reactor threads and I/O device framework, it will be helpful for us to have more handy tool that uses the same framework. My idea is alternative as Jin's and yours.
I'm fine if we have two types of bdevperf, creating new lightweight bdevperf threads, and using reactor threads for bdevperf tool.
I will be out of office most of next week and so response will be delayed but your patch will have important idea and I will definitely review it.
Thanks, Shuhei
On the other hand, the issue I had seen was gone by inserting 1 second sleep after bdev perf ran. I will ask experts their ideas
Hi @ak-gh
I believe your patch and my patches will be able to work well together.. I feel your patch will match https://trello.com/c/SeKQ0IFU/145-remove-assumption-from-applications-that-spdkthreads-pre-exist-maciek-tomek-vitaliy . On the other hand, my patches will try to use more thread APIs for bdevperf. I want to correct my previous comment.
-Shuhei
Hi Shuhei,
I have no problem with merging patches if you wish. Please feel free to combine mine and yours as you see fit.
Regards, Andrey
On Sat, Dec 28, 2019, 10:09 shuhei-matsumoto notifications@github.com wrote:
Hi @ak-gh https://github.com/ak-gh
I believe your patch and my patches will be able to work well together.. I feel your patch will match https://trello.com/c/SeKQ0IFU/145-remove-assumption-from-applications-that-spdkthreads-pre-exist-maciek-tomek-vitaliy . On the other hand, my patches will try to use more thread APIs for bdevperf. I want to correct my previous comment.
-Shuhei
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/spdk/spdk/issues/1096?email_source=notifications&email_token=AJG6WB3IPJSLTEPGVRSCK7DQ233SJA5CNFSM4JWESI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHYEEZY#issuecomment-569393767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJG6WB4CUQE3JLIJY34MC73Q233SJANCNFSM4JWESI3Q .
@shuhei-matsumoto I believe that your changes are fine. After looking through them, I believe that your changes simply changed the timing on the initiator side. The qpair cleanup on the target side is asynchronous so we have always been racing to clean up the old qpairs before initiating new connections from the second instance of bdevperf. The only reason the tests fail is because bdevperf doesn't delay and attempt to reconnect later if we get a connect reject event. I think that your current solution (adding a sleep before TC2) is the right one.
I saw #3 case for TCP transport.
in test/nvmf/target/shutdown.sh:89 -> nvmf_shutdown_tc3() ... nvme_qpair.c: 475:spdk_nvme_qpair_process_completions: ERROR: CQ error, abort requests after transport retry counter exceeded nvme_ctrlr.c: 677:nvme_ctrlr_fail: ERROR: ctrlr 127.0.0.1 in failed state. nvme_ctrlr.c:1072:spdk_nvme_ctrlr_reset: NOTICE: resetting controller posix.c: 338:spdk_posix_sock_create: ERROR: connect() failed, errno = 111 nvme_tcp.c:1618:nvme_tcp_qpair_connect: ERROR: sock connection error of tqpair=0x14a0090 with addr=127.0.0.1, port=4420 nvme_ctrlr.c:1094:spdk_nvme_ctrlr_reset: ERROR: Controller reinitialization failed. nvme_ctrlr.c: 677:nvme_ctrlr_fail: ERROR: ctrlr 127.0.0.1 in failed state. bdev_nvme.c: 305:_bdev_nvme_reset_complete: ERROR: Resetting controller failed. 84 trap 'process_shm --id $NVMF_APP_SHM_ID; kill -9 $perfpid || true; nvmftestfini; exit 1' SIGINT SIGTERM EXIT
Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_54821.html
Thank you @Seth5141 for your analysis!
Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55102.html
Another instance of this failure. Reported by @jimharris. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55191.html
Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55352.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_55438.html
Another instance of this failure. Reported by @peluse. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_56307.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58568.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58564.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58566.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58686.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58765.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_58715.html
Instance with RDMA on vs-dpdk job: https://ci.spdk.io/results/autotest-spdk-19-10-vs-dpdk-master/builds/1256/archive/nvmf_autotest_nightly/build.log
Another instance of this failure. Reported by @shuhei-matsumoto. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_59934.html
Another instance of this failure. Reported by @tomzawadzki. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_60387.html
Another instance of this failure. Reported by @benlwalker. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_60323.html
Another instance of this failure. Reported by @AlekseyMarchuk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_60113.html
Another instance of this failure. Reported by @wawryk. log: https://dqtibwqq6s6ux.cloudfront.net/public_build/autotest-per-patch_3281.html
All of the most recent failures are due to the this signature:
app.c: 642:spdk_app_start: NOTICE: Total cores available: 1 reactor.c: 346:_spdk_reactor_run: NOTICE: Reactor started on core 0 Running I/O for 10 seconds... ctrlr.c: 199:ctrlr_add_qpair_and_update_rsp: ERROR: Got I/O connect with duplicate QID 1 nvme_fabric.c: 395:nvme_fabric_qpair_connect: ERROR: Connect command failed nvme_rdma.c:1072:nvme_rdma_qpair_connect: ERROR: Failed to send an NVMe-oF Fabric CONNECT command nvme_ctrlr.c: 363:spdk_nvme_ctrlr_alloc_io_qpair: ERROR: nvme_transport_ctrlr_create_io_qpair() failed
We already have some random sleeps in the test script which I think are trying to workaround this problem. See the two sleeps in nvmf_shutdown_tc1. I'm thinking that if we remove those sleeps in a test patch we'll see this issue occur much more frequently and will be easier to reproduce locally and fix.
Hi @jimharris @Seth5141 and all,
If I remember correctly, removing the delay after nvmf_shutdown_tc1 caused this issue almost always, solidly.
Another solution we discussed before might be to fix bdevperf tool. However we have seen this issue in other IO tools too. We will need to add fix to other layer
So Jim’s proposal sounds reasonable but discussing our direction of fix before removing the delay may be better.
Chatted with @benlwalker on this issue offline. We think there could be an issue with cntlid generation which would cause the new process to connect to the controller from the old process before the target fully cleaned it up. I've added some instrumentation to my test patch and will run it through the test pool a few times to hopefully reproduce the failure. (Test patch passed first time after removing the sleeps.)
Hi @jimharris Thanks for sharing your current thought and work.
May I confirm my understanding? SPDK NVMe-oF target have used dynamic controller model but may have some cases such that available controller is not allocated. So The idea is to ensure to the host that available controller is allocated?
We always specify cntlid = 0xFFFF in the connect capsule for the admin queue. This tells the target to allocate a new cntlid.
So assume first process starts: admin queue connect (qid = 0, cntlid = 0xFFFF) - target allocates cntlid = 1 io queue connect (qid = 1, cntlid = 1) - success
Then the first process exits and a second process starts. The process should be: admin queue connect (cntlid = 0xFFFF) - target allocates cntlid = 2 io queue connect (qid = 1, cntlid = 2) - success
but sometimes this second io queue connect fails - it says that QID 1 is already allocated. The suspicion is that when this fails, it is because an incorrect cntlid was returned for the second admin queue connect. But it's not clear yet. So the instrumentation will give us more insight into whether the failure is due to a corrupted cntlid and indicate what to pursue next.
Another instance of this failure. Reported by @benlwalker. log: https://ci.spdk.io/public_build/autotest-per-patch_5145.html
Hi @jimharris,
For your consideration on cntlid may be the possibility, but in the code cntlid is generated by the logic in the same thread (and cntlid is linearly increased with a loop). I think it may be related qpair_mask which is not handled in the same thread, if you search "qpair_mask" in the code base. I found that qpair_mask seems not all handled by the same thread "ctrlr->thread" . And the following patch may guarantee this: https://review.spdk.io/gerrit/c/spdk/spdk/+/1226
Another instance of this failure. Reported by @glodfish-zg. log: https://ci.spdk.io/public_build/autotest-per-patch_6054.html
Thanks Ziye. I'm still trying to reproduce the original failure with this patch:
https://review.spdk.io/gerrit/c/spdk/spdk/+/1046
It removes the sleeps between test cases, which we think should make the problem occur more frequently. The problem is that when the second process connects, it should get a new cntlid meaning a new qpair_mask as well. The attached patch also adds some more debug prints around the generated cntlids - hopefully we can get that patch to fail, observed the extra logging, and then hopefully your patch is something that might fix it.
Another instance of this failure. Reported by @glodfish-zg. log: https://ci.spdk.io/public_build/autotest-per-patch_6544.html
Another instance of this failure. Reported by @glodfish-zg. log: https://ci.spdk.io/public_build/autotest-per-patch_7082.html
Another instance of this failure. Reported by @tomzawadzki. log: https://ci.spdk.io/public_build/autotest-per-patch_10223.html
Another instance of this failure. Reported by @Seth5141. log: https://ci.spdk.io/public_build/autotest-per-patch_10545.html
Another instance of this failure. Reported by @Seth5141. log: https://ci.spdk.io/public_build/autotest-per-patch_10551.html
Another instance of this failure. Reported by @Seth5141. log: https://ci.spdk.io/public_build/autotest-per-patch_10667.html
I have this root caused and a patch series out to address the issue.
The problem is this. Historically when we disconnected a qpair from the initiator, we simply free all of the initiator side memory and called it a day. In the case where we were freeing a qpair from the bdev layer, we went one step further and just called spdk_nvme_ctrlr_free_io_qpair directly which calls disconnect and then immediately destroys the qpair. It also clears the qid bit in the controller side.
This takes us to the target side. When the initiator clears all of its memory, we get an RDMA_CM_EVENT_DISCONNECTED message at some point (probably pretty soon after the initiator "disconnect"). We then proceed to tear down the qpair on the target side, eventually clearing the qid on the target side as well.
However, the target can clean up its qid list after the initiator tries to connect another qpair with the same qid it just connected. This race has always been there, but I think as bdevperf has been getting more efficient, it's been popping up more and more.
The fix is actually pretty simple. the rdma core library provides a function "rdma_disconnect" with can be called from both sides of the connection and causes an RDMA_CM_EVENT_DISCONNECTED message to be generated on the other side. So when disconnecting on the initiator, we can call this function and then wait for the target to disconnect. It will also call rdma_disconnect notifying us that its done and we can feel safe that we are not racing with the target to clean up the qid lists.
Patch series here: https://review.spdk.io/gerrit/c/spdk/spdk/+/1879
Current Behavior
I hit this failure on a one of my patches today (December 5 2019)
Context (Environment including OS version, SPDK version, etc.)
nvmf_tcp_autotest