open-iscsi / tcmu-runner

A daemon that handles the userspace side of the LIO TCM-User backstore.
Apache License 2.0
189 stars 149 forks source link

LIO/kernel crash when aborting commands #586

Open Talangor opened 5 years ago

Talangor commented 5 years ago

Hi

I think its due to my legacy hardware(hp dl380 g6 bare metal environment) but I felt to share it

I have luminous ceph deployed and my rbd pool has tiering enabled

When I try to test iscsi with Vmware I get these errors and lost connection to datastore when it reconnects io drops dramatically

I suspect hardware like CPU and Ram Problem or motherboard and firmware

And going to change to g7 or g8 servers but I like to hear your thoughts about it

uname -a Linux ceph-client01 5.2.4-1.el7.elrepo.x86_64 #1 SMP Sun Jul 28 08:10:57 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux

/etc/redhat-release CentOS Linux release 7.6.1810 (Core)

o- / ......................................................................................................................... [...] o- cluster ......................................................................................................... [Clusters: 1] | o- ceph ............................................................................................................ [HEALTH_OK] | o- pools .......................................................................................................... [Pools: 8] | | o- .rgw.root ............................................................. [(x3), Commit: 0.00Y/615960064M (0%), Used: 768K] | | o- default.rgw.buckets.index ............................................ [(x3), Commit: 0.00Y/615960064M (0%), Used: 0.00Y] | | o- default.rgw.control .................................................. [(x3), Commit: 0.00Y/615960064M (0%), Used: 0.00Y] | | o- default.rgw.log ...................................................... [(x3), Commit: 0.00Y/615960064M (0%), Used: 0.00Y] | | o- default.rgw.meta ..................................................... [(x3), Commit: 0.00Y/615960064M (0%), Used: 1920K] | | o- nfs-ganesha ........................................................... [(x3), Commit: 0.00Y/615960064M (0%), Used: 192K] | | o- rbd ............................................................... [(x3), Commit: 4.0T/615960064M (0%), Used: 42070464K] | | o- rbd-cache .................................................... [(x3), Commit: 0.00Y/7085489664K (0%), Used: 98739535950b] | o- topology .............................................................................................. [OSDs: 216,MONs: 3] o- disks ........................................................................................................ [4.0T, Disks: 1] | o- rbd ............................................................................................................ [rbd (4.0T)] | o- disk_1 ................................................................................................ [rbd/disk_1 (4.0T)] o- iscsi-targets ............................................................................... [DiscoveryAuth: None, Targets: 1] o- iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw ..................................................................... [Gateways: 2] o- disks .......................................................................................................... [Disks: 1] | o- rbd/disk_1 ....................................................................................... [Owner: ceph-client01] o- gateways ............................................................................................ [Up: 2/2, Portals: 2] | o- ceph-client01 ....................................................................................... [10.47.84.185 (UP)] | o- ceph-client02 ....................................................................................... [10.47.84.186 (UP)] o- host-groups .................................................................................................. [Groups : 0] o- hosts .............................................................................................. [Hosts: 1: Auth: CHAP] o- iqn.1994-05.com.redhat:rh7-client ............................................... [LOGGED-IN, Auth: CHAP, Disks: 1(4.0T)] o- lun 0 ........................................................................ [rbd/disk_1(4.0T), Owner: ceph-client01]

Aug 6 16:08:34 ceph-client01 kernel: perf: interrupt took too long (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78000

Aug 6 16:10:53 ceph-client01 yum[12476]: Installed: htop-2.2.0-3.el7.x86_64

Aug 6 16:11:16 ceph-client01 kernel: perf: interrupt took too long (3740 > 3186), lowering kernel.perf_event_max_sample_rate to 53000

Aug 6 16:12:08 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:08 ceph-client01 tcmu-runner: tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

Aug 6 16:12:09 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:09 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:11 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:11 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:12 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:12 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:14 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:15 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:15 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:15 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:15 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:16 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:16 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:16 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:17 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:17 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:18 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:18 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:18 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:19 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:19 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:12:19 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

And

Aug 6 16:21:38 ceph-client01 tcmu-runner: alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

Aug 6 16:21:38 ceph-client01 tcmu-runner: tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

Aug 6 16:21:43 ceph-client01 systemd: Stopping LIO Userspace-passthrough daemon...

Aug 6 16:21:43 ceph-client01 tcmu-runner: main:1390: Exiting...

Aug 6 16:21:52 ceph-client01 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 282676

Aug 6 16:22:02 ceph-client01 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 282798

Aug 6 16:22:08 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:22:08 ceph-client01 tcmu-runner: tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

Aug 6 16:22:08 ceph-client01 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 282676

Aug 6 16:22:08 ceph-client01 tcmu-runner: tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

Aug 6 16:22:08 ceph-client01 kernel: WARNING: CPU: 4 PID: 11043 at drivers/target/target_core_transport.c:761 target_handle_abort+0x193/0x1b0 [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 282798

Aug 6 16:22:08 ceph-client01 kernel: Modules linked in: target_core_pscsi target_core_file target_core_iblock target_core_user uio bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad rdma_cm ib_cm iw_cm gpio_ich iTCO_wdt iTCO_vendor_support sunrpc intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate input_leds pcspkr hpwdt hpilo ocrdma ib_uverbs lpc_ich mfd_core ib_core i7core_edac ipmi_si ipmi_devintf ipmi_msghandler sg acpi_power_meter pcc_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod crc32c_intel serio_raw radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix bnx2 libata lpfc nvmet_fc nvmet nvme_fc nvme_fabrics scsi_transport_fc be2net hpsa scsi_transport_sas

Aug 6 16:22:08 ceph-client01 kernel: CPU: 4 PID: 11043 Comm: kworker/4:1 Tainted: G I 5.2.4-1.el7.elrepo.x86_64 #1

Aug 6 16:22:08 ceph-client01 kernel: Hardware name: HP ProLiant DL380 G6, BIOS P62 08/16/2015

Aug 6 16:22:08 ceph-client01 kernel: Workqueue: target_completion target_abort_work [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: RIP: 0010:target_handle_abort+0x193/0x1b0 [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: Code: c8 00 00 00 e8 ee 9c 39 e1 e9 ed fe ff ff 48 8b 73 70 48 89 df 31 c9 89 c2 e8 69 ee ff ff 5b 41 5c 41 5d 5d c3 e8 3d f3 ff ff <0f> 0b e9 ca fe ff ff 0f 0b 0f 1f 40 00 e9 cd fe ff ff 90 66 2e 0f

Aug 6 16:22:08 ceph-client01 kernel: RSP: 0018:ffffc9000e7b3e30 EFLAGS: 00010282

Aug 6 16:22:08 ceph-client01 kernel: RAX: 0000606020407d68 RBX: ffff889fb03c4770 RCX: 0000000000000000

Aug 6 16:22:08 ceph-client01 kernel: RDX: ffff889fdf62bf20 RSI: 0000000000000282 RDI: 0000000000000282

Aug 6 16:22:08 ceph-client01 kernel: RBP: ffffc9000e7b3e48 R08: 0000000000000001 R09: ffffffc3fa805e79

Aug 6 16:22:08 ceph-client01 kernel: R10: ffff889fd3ecd9c0 R11: 0000000000000018 R12: 0000000000000000

Aug 6 16:22:08 ceph-client01 kernel: R13: 0000000000400000 R14: ffffe8efffc86800 R15: ffff889fb03c48a8

Aug 6 16:22:08 ceph-client01 kernel: FS: 0000000000000000(0000) GS:ffff888fdf880000(0000) knlGS:0000000000000000

Aug 6 16:22:08 ceph-client01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Aug 6 16:22:08 ceph-client01 kernel: CR2: 00007fa2818d2f98 CR3: 0000000fd6be4000 CR4: 00000000000006e0

Aug 6 16:22:08 ceph-client01 kernel: Call Trace:

Aug 6 16:22:08 ceph-client01 kernel: target_abort_work+0x15/0x20 [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: process_one_work+0x179/0x3a0

Aug 6 16:22:08 ceph-client01 kernel: worker_thread+0x4f/0x3e0

Aug 6 16:22:08 ceph-client01 kernel: kthread+0x105/0x140

Aug 6 16:22:08 ceph-client01 kernel: ? max_active_store+0x80/0x80

Aug 6 16:22:08 ceph-client01 kernel: ? kthread_bind+0x20/0x20

Aug 6 16:22:08 ceph-client01 kernel: ret_from_fork+0x35/0x40

Aug 6 16:22:08 ceph-client01 kernel: ---[ end trace 64125901f2b9de84 ]---

Aug 6 16:22:08 ceph-client01 kernel: WARNING: CPU: 4 PID: 11043 at drivers/target/target_core_transport.c:769 target_handle_abort+0x19a/0x1b0 [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: Modules linked in: target_core_pscsi target_core_file target_core_iblock target_core_user uio bonding rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad rdma_cm ib_cm iw_cm gpio_ich iTCO_wdt iTCO_vendor_support sunrpc intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate input_leds pcspkr hpwdt hpilo ocrdma ib_uverbs lpc_ich mfd_core ib_core i7core_edac ipmi_si ipmi_devintf ipmi_msghandler sg acpi_power_meter pcc_cpufreq ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod crc32c_intel serio_raw radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix bnx2 libata lpfc nvmet_fc nvmet nvme_fc nvme_fabrics scsi_transport_fc be2net hpsa scsi_transport_sas

Aug 6 16:22:08 ceph-client01 kernel: CPU: 4 PID: 11043 Comm: kworker/4:1 Tainted: G W I 5.2.4-1.el7.elrepo.x86_64 #1

Aug 6 16:22:08 ceph-client01 kernel: Hardware name: HP ProLiant DL380 G6, BIOS P62 08/16/2015

Aug 6 16:22:08 ceph-client01 kernel: Workqueue: target_completion target_abort_work [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: RIP: 0010:target_handle_abort+0x19a/0x1b0 [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: Code: 39 e1 e9 ed fe ff ff 48 8b 73 70 48 89 df 31 c9 89 c2 e8 69 ee ff ff 5b 41 5c 41 5d 5d c3 e8 3d f3 ff ff 0f 0b e9 ca fe ff ff <0f> 0b 0f 1f 40 00 e9 cd fe ff ff 90 66 2e 0f 1f 84 00 00 00 00 00

Aug 6 16:22:08 ceph-client01 kernel: RSP: 0018:ffffc9000e7b3e30 EFLAGS: 00010246

Aug 6 16:22:08 ceph-client01 kernel: RAX: 0000000000000000 RBX: ffff889fb03c4770 RCX: 0000000000000000

Aug 6 16:22:08 ceph-client01 kernel: RDX: ffff889fdf62bf20 RSI: 0000000000000282 RDI: 0000000000000282

Aug 6 16:22:08 ceph-client01 kernel: RBP: ffffc9000e7b3e48 R08: 0000000000000001 R09: ffffffc3fa805e79

Aug 6 16:22:08 ceph-client01 kernel: R10: ffff889fd3ecd9c0 R11: 0000000000000018 R12: 0000000000000000

Aug 6 16:22:08 ceph-client01 kernel: R13: 0000000000400000 R14: ffffe8efffc86800 R15: ffff889fb03c48a8

Aug 6 16:22:08 ceph-client01 kernel: FS: 0000000000000000(0000) GS:ffff888fdf880000(0000) knlGS:0000000000000000

Aug 6 16:22:08 ceph-client01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Aug 6 16:22:08 ceph-client01 kernel: CR2: 00007fa2818d2f98 CR3: 0000000fd6be4000 CR4: 00000000000006e0

Aug 6 16:22:08 ceph-client01 kernel: Call Trace:

Aug 6 16:22:08 ceph-client01 kernel: target_abort_work+0x15/0x20 [target_core_mod]

Aug 6 16:22:08 ceph-client01 kernel: process_one_work+0x179/0x3a0

Aug 6 16:22:08 ceph-client01 kernel: worker_thread+0x4f/0x3e0

Aug 6 16:22:08 ceph-client01 kernel: kthread+0x105/0x140

Aug 6 16:22:08 ceph-client01 kernel: ? max_active_store+0x80/0x80

Aug 6 16:22:08 ceph-client01 kernel: ? kthread_bind+0x20/0x20

Aug 6 16:22:08 ceph-client01 kernel: ret_from_fork+0x35/0x40

Aug 6 16:22:08 ceph-client01 kernel: ---[ end trace 64125901f2b9de85 ]---

Aug 6 16:22:08 ceph-client01 tcmu-runner: tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

Aug 6 16:22:08 ceph-client01 tcmu-runner: tgt_port_grp_recovery_thread_fn:257 rbd/rbd.disk_1: Could not reinitialize device. (err -5).

Aug 6 16:22:08 ceph-client01 tcmu-runner: tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

Aug 6 16:22:09 ceph-client01 systemd: Stopped LIO Userspace-passthrough daemon.

Aug 6 16:22:09 ceph-client01 systemd: Starting LIO Userspace-passthrough daemon...

Aug 6 16:22:09 ceph-client01 tcmu-runner: log file path now is '/var/log/tcmu-runner.log'

Tcmu-runner

2019-08-06 16:02:00.316 9092 [INFO] alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

2019-08-06 16:02:00.361 9092 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

2019-08-06 16:12:08.933 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:08.933 9092 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

2019-08-06 16:12:09.143 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:09.746 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:11.680 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:11.909 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:12.544 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:12.812 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:14.444 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:15.061 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:15.168 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:15.540 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:15.552 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:16.488 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:16.873 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:16.948 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:17.064 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:17.676 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:18.026 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:18.865 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:18.874 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:19.362 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:19.421 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:19.775 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:19.776 9092 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:12:19.997 9092 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:12:24.569 9092 [INFO] alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

2019-08-06 16:12:24.581 9092 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

2019-08-06 16:12:54.887 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:12:54.887 9092 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

2019-08-06 16:12:54.889 9092 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:12:55.060 9092 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:12:58.674 9092 [INFO] alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

2019-08-06 16:12:58.686 9092 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

2019-08-06 16:18:19.045 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:19.045 9092 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

2019-08-06 16:18:19.100 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:21.046 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:21.407 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:22.123 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:22.415 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:22.891 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:24.561 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:25.296 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:25.833 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:28.740 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:29.387 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:30.617 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:31.206 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:31.820 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:32.131 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:32.857 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:33.265 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:33.283 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:34.230 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:35.101 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:35.211 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:36.461 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:37.695 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:37.714 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:38.219 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:38.634 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:40.622 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:41.685 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:42.547 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:42.922 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:44.712 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:18:44.713 9092 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:18:44.913 9092 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:18:49.742 9092 [INFO] alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

2019-08-06 16:18:49.754 9092 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

2019-08-06 16:21:20.370 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:20.370 9092 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

2019-08-06 16:21:25.464 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:26.530 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:27.181 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:27.199 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:29.023 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:30.348 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:32.126 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:32.569 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:33.767 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:34.276 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:21:34.277 9092 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:21:34.458 9092 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:21:38.219 9092 [INFO] alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

2019-08-06 16:21:38.230 9092 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

2019-08-06 16:21:43.645 9092 [CRIT] main:1390: Exiting...

2019-08-06 16:22:08.404 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:22:08.404 9092 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

2019-08-06 16:22:08.547 9092 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:22:08.547 9092 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:22:08.563 9092 [ERROR] tgt_port_grp_recovery_thread_fn:257 rbd/rbd.disk_1: Could not reinitialize device. (err -5).

2019-08-06 16:22:08.563 9092 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.

2019-08-06 16:22:12.337 12661 [INFO] alua_implicit_transition:566 rbd/rbd.disk_1: Starting lock acquisition operation.

2019-08-06 16:22:12.349 12661 [WARN] tcmu_rbd_lock:762 rbd/rbd.disk_1: Acquired exclusive lock.

2019-08-06 16:28:16.401 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:16.401 12661 [ERROR] tcmu_notify_conn_lost:201 rbd/rbd.disk_1: Handler connection lost (lock state 1)

2019-08-06 16:28:16.736 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:19.104 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:19.220 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:19.663 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:19.930 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:19.997 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:20.725 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:20.961 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:22.151 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:22.831 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:23.834 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:23.903 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:24.537 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:24.548 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:26.148 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:27.468 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:28.146 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:29.179 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:30.325 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:32.878 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:33.305 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:33.539 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:35.441 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:35.546 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:35.797 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:36.033 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:37.553 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:40.703 12661 [ERROR] tcmu_rbd_handle_timedout_cmd:995 rbd/rbd.disk_1: Timing out cmd.

2019-08-06 16:28:40.704 12661 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt

mikechristie commented 5 years ago

It looks like multiple issues.

  1. The ceph/rbd timeout and lio abort messages indicate that commands are taking too long. If you are using default values then it would be 30 seconds.

You might have bad hw that could be causing commands to just take a long time. For example a bad disk or bad NIC might cause issues on the ceph side.

To double check it is the hw you would want to do the normal debugging and look in the /var/log/ceph logs for messages about timeout, commands taking too long, etc. And check the Ceph OSD and Mon nodes for kernel/hw errors, check the network, disks, etc.

  1. When commands take too long, the initiator will send TMFs. This is causing you to hit a possible bug in the linux kernel target layer:

Aug 6 16:22:08 ceph-client01 kernel: WARNING: CPU: 4 PID: 11043 at drivers/target/target_core_transport.c:761 target_handle_abort+0x193/0x1b0 [target_core_mod]

I have not seen this before. I would bet it's a regression caused by the patches that brought in:

commit 2c9fa49e100f962af988f1c0529231bf14905cda (HEAD) Author: Bart Van Assche bvanassche@acm.org Date: Tue Nov 27 15:52:03 2018 -0800

scsi: target/core: Make ABORT and LUN RESET handling synchronous

I am going to try and do some testing and force the issue.

Talangor commented 5 years ago

hi @mikechristie and thank you for your fast and informative reply I thought it might be because of perf and the sample rate drop would this help echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent as I told I'm going to change servers and will report if the same thing happens with g7 or it solves it I forgot to tell you NICs are 2 10Gb/s FC and load-balanced Mode 4 (802.3ad) with MTU 1500

mikechristie commented 5 years ago

hi @mikechristie and thank you for your fast and informative reply I thought it might be because of perf and the sample rate drop would this help echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent

I do not think so. Commands on the OSDs are just taking a really long time. Something on those systems like the hardware or the network between the OSDs and iscsi gateways is not right.

as I told I'm going to change servers and will report if the same thing happens with g7 or it solves it I forgot to tell you NICs are 2 10Gb/s FC and load-balanced Mode 4 (802.3ad) with MTU 1500

It won't solve the issue, but you should use jumbo frames if you can.