spdk / spdk

Storage Performance Development Kit
https://spdk.io/
Other
3.06k stars 1.19k forks source link

[bdev_bounds] bdev_nvme_failover_ctrlr_unsafe: Reset is already in progress #3206

Closed tomzawadzki closed 7 months ago

tomzawadzki commented 10 months ago

CI Intermittent Failure

This failure occurs intermittently once ASAN is enabled on nvme-phy-autotest job.

ASAN points to accel_sequence_complete_tasks(), but the issue might lie with failover being triggered as well:

bdev_nvme.c:2814:bdev_nvme_failover_ctrlr_unsafe: *NOTICE*: Reset is already in progress. Defer failover until reset completes

Another log from crypto-phy-autotest would confirm that the issue might the due to failover.

Wasn't able to reproduce it on my own system and after adding more logging (here) the issue did not yet reproduce.

Link to the failed CI build

https://ci.spdk.io/results/autotest-per-patch/builds/113176/archive/nvme-phy-autotest_57397/build.log https://ci.spdk.io/results/autotest-per-patch/builds/113351/archive/crypto-phy-autotest_11625/build.log

Execution failed at

00:13:19.764    Test: blockdev write zeroes read split partial ...passed
00:13:19.764    Test: blockdev reset ...[2023-12-01 17:21:41.097329] nvme_ctrlr.c:1638:nvme_ctrlr_disconnect: *NOTICE*: [0000:d9:00.0] resetting controller
00:13:19.764  [2023-12-01 17:21:41.107805] bdev_nvme.c:2814:bdev_nvme_failover_ctrlr_unsafe: *NOTICE*: Reset is already in progress. Defer failover until reset completes.
00:13:23.047  [2023-12-01 17:21:44.543051] bdev_nvme.c:2026:_bdev_nvme_reset_ctrlr_complete: *NOTICE*: Resetting controller successful.
00:13:23.047  [2023-12-01 17:21:44.543103] nvme_ctrlr.c:1638:nvme_ctrlr_disconnect: *NOTICE*: [0000:d9:00.0] resetting controller
00:13:23.047  passed
00:13:23.047    Test: blockdev write read 8 blocks ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev write read size > 128k ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev write read invalid size ...passed
00:13:23.047    Test: blockdev write read offset + nbytes == size of blockdev ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev write read offset + nbytes > size of blockdev ...passed
00:13:23.047    Test: blockdev write read max offset ...passed
00:13:23.047    Test: blockdev write read 2 blocks on overlapped address offset ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      3. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      4. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev writev readv 8 blocks ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev writev readv 30 x 1block ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev writev readv block ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev writev readv size > 128k ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev writev readv size > 128k in two iovs ...FAILED
00:13:23.047      1. bdevio.c:505  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:512  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047    Test: blockdev comparev and writev ...FAILED
00:13:23.047      1. bdevio.c:544  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      2. bdevio.c:547  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      3. bdevio.c:553  - CU_ASSERT_EQUAL(g_completion_success,true)
00:13:23.047      4. bdevio.c:557  - CU_ASSERT_EQUAL(rc,0)
00:13:23.047    Test: blockdev nvme passthru rw ...AddressSanitizer:DEADLYSIGNAL
00:13:23.047  =================================================================
00:13:23.047  ==608422==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000200 (pc 0x7feebab723fb bp 0x7feeb5af82a0 sp 0x7feeb5af8260 T3)
00:13:23.047  ==608422==The signal is caused by a READ memory access.
00:13:23.047  ==608422==Hint: address points to the zero page.
00:13:23.047      #0 0x7feebab723fb in accel_sequence_complete_tasks /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/accel/accel.c:1204
00:13:23.047      #1 0x7feebab828c2 in spdk_accel_sequence_abort /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/accel/accel.c:1947
00:13:23.047      #2 0x7feebcea858a in bdev_nvme_io_complete /var/jenkins/workspace/nvme-phy-autotest/spdk/module/bdev/nvme/bdev_nvme.c:1420
00:13:23.047      #3 0x7feebcebb7a9 in bdev_nvme_submit_request /var/jenkins/workspace/nvme-phy-autotest/spdk/module/bdev/nvme/bdev_nvme.c:3077
00:13:23.047      #4 0x7feebae60d92 in bdev_submit_request /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/bdev/bdev.c:1430
00:13:23.047      #5 0x7feebae71832 in bdev_io_do_submit /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/bdev/bdev.c:2747
00:13:23.047      #6 0x7feebae7c46b in _bdev_io_submit /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/bdev/bdev.c:3442
00:13:23.047      #7 0x7feebae7e9c7 in bdev_io_submit /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/bdev/bdev.c:3551
00:13:23.047      #8 0x7feebaeaa2ee in spdk_bdev_nvme_io_passthru /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/bdev/bdev.c:6647
00:13:23.047      #9 0x40da01 in __blockdev_nvme_passthru /var/jenkins/workspace/nvme-phy-autotest/spdk/test/bdev/bdevio/bdevio.c:1084
00:13:23.047      #10 0x7feeba8618fa in msg_queue_run_batch /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/thread/thread.c:848
00:13:23.047      #11 0x7feeba866107 in thread_poll /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/thread/thread.c:1070
00:13:23.047      #12 0x7feeba8673e2 in spdk_thread_poll /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/thread/thread.c:1163
00:13:23.047      #13 0x7feebb1003e9 in _reactor_run /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/event/reactor.c:914
00:13:23.047      #14 0x7feebb100d21 in reactor_run /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/event/reactor.c:952
00:13:23.047      #15 0x7feeba302022 in eal_thread_loop ../lib/eal/common/eal_common_thread.c:212
00:13:23.047      #16 0x7feeba33d3f1 in eal_worker_thread_loop ../lib/eal/linux/eal.c:916
00:13:23.047      #17 0x7feeb93c212c in start_thread (/usr/lib64/libc.so.6+0x8b12c)
00:13:23.047      #18 0x7feeb9443bbf in __clone3 (/usr/lib64/libc.so.6+0x10cbbf)
00:13:23.047  
00:13:23.047  AddressSanitizer can not provide additional info.
00:13:23.047  SUMMARY: AddressSanitizer: SEGV /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/accel/accel.c:1204 in accel_sequence_complete_tasks
00:13:23.047  Thread T3 (reactor_2) created by T0 (reactor_0) here:
00:13:23.047      #0 0x7feebd1943e6 in __interceptor_pthread_create (/usr/lib64/libasan.so.8+0x4b3e6)
00:13:23.047      #1 0x7feeba33d621 in eal_worker_thread_create ../lib/eal/linux/eal.c:953
00:13:23.047      #2 0x7feeba33e689 in rte_eal_init ../lib/eal/linux/eal.c:1269
00:13:23.047      #3 0x7feebb2e782c in spdk_env_init /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/env_dpdk/init.c:610
00:13:23.047      #4 0x7feebb0e68dd in app_setup_env /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/event/app.c:385
00:13:23.047      #5 0x7feebb0efe0a in spdk_app_start /var/jenkins/workspace/nvme-phy-autotest/spdk/lib/event/app.c:770
00:13:23.047      #6 0x411310 in main /var/jenkins/workspace/nvme-phy-autotest/spdk/test/bdev/bdevio/bdevio.c:1562
00:13:23.047      #7 0x7feeb935e50f in __libc_start_call_main (/usr/lib64/libc.so.6+0x2750f)
00:13:23.047  
00:13:23.047  ==608422==ABORTING
spdkci commented 8 months ago

Another instance of this failure. Reported by @mmkayPL. log: https://ci.spdk.io/public_build/autotest-per-patch_116029.html

sberbz commented 8 months ago

Hi, here is a quick update on the issue I'm able to reproduce very similar behavior on my setup just by running test/bdev/blockdev.sh nvme and, most of the time, when failover happens, some of the further tests will fail. The probable reason for failure is a second reset that occurs during failover, and it's not recognized by the caller (the reset test ends with a callback sent after the first reset is done), so tests are continued even if the second reset is still in progress. I was not able to reproduce access to null ptr in accel_sequence_complete_tasks, but I have observed other issues reported by asan probably due to releasing qpair while resetting.

sberbz commented 8 months ago

Patch ready for review: https://review.spdk.io/gerrit/c/spdk/spdk/+/21901