open-iscsi / tcmu-runner

A daemon that handles the userspace side of the LIO TCM-User backstore.
Apache License 2.0
189 stars 149 forks source link

Received CmdSN is greater than MaxCmdSN #697

Closed discostur closed 1 year ago

discostur commented 1 year ago

Hi,

i'm running tcmu-runner 1.6.0 with ceph 16.x and since a few hours im getting strange error messages:

[Tue Aug  1 18:45:16 2023] Received CmdSN: 0x8d9f9eb3 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:16 2023] Received CmdSN: 0x8d9f9eb4 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:22 2023] Received CmdSN: 0x8d9f9eb5 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:25 2023] Received CmdSN: 0x8d9f9eb6 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:27 2023] Received CmdSN: 0x8d9f9eb7 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:28 2023] Received CmdSN: 0x8d9f9eb8 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:28 2023] Received CmdSN: 0x8d9f9eb9 is greater than MaxCmdSN: 0x8d9f9df3, ignoring.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000060 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000060 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x800001b1 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x800001b1 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000127 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000127 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x8000011e on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x8000011e on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x800001b3 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x800001b3 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x800000a3 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x800000a3 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000142 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000142 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x800000ac on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x800000ac on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x8000005d on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x8000005d on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000110 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000110 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000061 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000061 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000095 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000095 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x800000fc on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x800000fc on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000145 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000145 on CID: 0.
[Tue Aug  1 18:45:31 2023] Unable to locate ITT: 0x80000149 on CID: 0
[Tue Aug  1 18:45:31 2023] Unable to locate RefTaskTag: 0x80000149 on CID: 0.
[Tue Aug  1 18:45:31 2023] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
[Tue Aug  1 18:45:31 2023] ABORT_TASK: Found referenced iSCSI task_tag: 2147483804
[Tue Aug  1 18:46:20 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:46:20 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:46:20 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:46:38 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:46:38 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:46:38 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:46:55 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:46:55 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:46:55 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:47:12 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:47:12 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:47:12 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:47:29 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:47:29 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:47:29 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:47:45 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:47:45 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:47:45 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:48:03 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:48:03 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:48:03 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:48:20 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:48:20 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:48:20 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:48:37 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:48:37 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:48:37 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:48:54 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:48:54 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:48:54 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:49:10 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:49:10 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:49:10 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:49:28 2023] iSCSI Login timeout on Network Portal 192.168.26.42:3260
[Tue Aug  1 18:49:28 2023] tx_data returned -104, expecting 48.
[Tue Aug  1 18:49:28 2023] iSCSI Login negotiation failed.
[Tue Aug  1 18:49:30 2023] INFO: task kworker/3:1:147 blocked for more than 120 seconds.
[Tue Aug  1 18:49:30 2023]       Tainted: G           OE    --------- -  - 4.18.0-477.15.1.el8_8.x86_64 #1
[Tue Aug  1 18:49:30 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Aug  1 18:49:30 2023] task:kworker/3:1     state:D stack:    0 pid:  147 ppid:     2 flags:0x80004000
[Tue Aug  1 18:49:30 2023] Workqueue: events target_tmr_work [target_core_mod]
[Tue Aug  1 18:49:30 2023] Call Trace:
[Tue Aug  1 18:49:30 2023]  __schedule+0x2d1/0x870
[Tue Aug  1 18:49:30 2023]  schedule+0x55/0xf0
[Tue Aug  1 18:49:30 2023]  schedule_timeout+0x278/0x300
[Tue Aug  1 18:49:30 2023]  ? irq_work_queue+0x9/0x30
[Tue Aug  1 18:49:30 2023]  ? wake_up_klogd.part.30+0x30/0x40
[Tue Aug  1 18:49:30 2023]  ? vprintk_emit+0x125/0x250
[Tue Aug  1 18:49:30 2023]  wait_for_completion+0x96/0x100
[Tue Aug  1 18:49:30 2023]  target_put_cmd_and_wait+0x6e/0x90 [target_core_mod]
[Tue Aug  1 18:49:30 2023]  core_tmr_abort_task.cold.7+0x19b/0x216 [target_core_mod]
[Tue Aug  1 18:49:30 2023]  target_tmr_work+0x96/0x110 [target_core_mod]
[Tue Aug  1 18:49:30 2023]  process_one_work+0x1a7/0x360
[Tue Aug  1 18:49:30 2023]  ? create_worker+0x1a0/0x1a0
[Tue Aug  1 18:49:30 2023]  worker_thread+0x30/0x390
[Tue Aug  1 18:49:30 2023]  ? create_worker+0x1a0/0x1a0
[Tue Aug  1 18:49:30 2023]  kthread+0x134/0x150
[Tue Aug  1 18:49:30 2023]  ? set_kthread_struct+0x50/0x50
[Tue Aug  1 18:49:30 2023]  ret_from_fork+0x35/0x40
[Tue Aug  1 18:49:30 2023] INFO: task kworker/12:1:221 blocked for more than 120 seconds.
[Tue Aug  1 18:49:30 2023]       Tainted: G           OE    --------- -  - 4.18.0-477.15.1.el8_8.x86_64 #1
[Tue Aug  1 18:49:30 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Aug  1 18:49:30 2023] task:kworker/12:1    state:D stack:    0 pid:  221 ppid:     2 flags:0x80004000
[Tue Aug  1 18:49:30 2023] Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023] Call Trace:
[Tue Aug  1 18:49:30 2023]  __schedule+0x2d1/0x870
[Tue Aug  1 18:49:30 2023]  ? sched_clock_cpu+0xc/0xb0
[Tue Aug  1 18:49:30 2023]  ? __smp_call_single_queue+0x23/0x40
[Tue Aug  1 18:49:30 2023]  schedule+0x55/0xf0
[Tue Aug  1 18:49:30 2023]  schedule_timeout+0x278/0x300
[Tue Aug  1 18:49:30 2023]  ? signal_wake_up_state+0x15/0x30
[Tue Aug  1 18:49:30 2023]  ? __send_signal+0x359/0x4b0
[Tue Aug  1 18:49:30 2023]  wait_for_completion+0x96/0x100
[Tue Aug  1 18:49:30 2023]  iscsit_cause_connection_reinstatement+0x95/0xf0 [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023]  iscsit_stop_session+0xff/0x190 [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023]  iscsi_check_for_session_reinstatement+0x1e8/0x280 [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023]  iscsi_target_do_login+0x17e/0x570 [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023]  iscsi_target_do_login_rx+0x28c/0x370 [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023]  ? iscsi_target_restore_sock_callbacks+0xc0/0xc0 [iscsi_target_mod]
[Tue Aug  1 18:49:30 2023]  process_one_work+0x1a7/0x360
[Tue Aug  1 18:49:30 2023]  ? create_worker+0x1a0/0x1a0
[Tue Aug  1 18:49:30 2023]  worker_thread+0x30/0x390
[Tue Aug  1 18:49:30 2023]  ? create_worker+0x1a0/0x1a0
[Tue Aug  1 18:49:30 2023]  kthread+0x134/0x150
[Tue Aug  1 18:49:30 2023]  ? set_kthread_struct+0x50/0x50
[Tue Aug  1 18:49:30 2023]  ret_from_fork+0x35/0x40

Any idea what to do?

Thanks, Regards Kilian

maurizio-lombardi commented 1 year ago

Is it possible for you to download and test this kernel? kernel-4.18.0-477.16.1.el8_8 I remember there was a fix to prevent aborted commands from hanging the kernel and block re-logins

discostur commented 1 year ago

@maurizio-lombardi just checked i'm on el 8.8 with the latest kernel 4.18.0-477.15.1.el8_8.x86_64 ... the 16.1 is not yet available. Do you have any sources where i can download it and try?

maurizio-lombardi commented 1 year ago

@maurizio-lombardi just checked i'm on el 8.8 with the latest kernel 4.18.0-477.15.1.el8_8.x86_64 ... the 16.1 is not yet available. Do you have any sources where i can download it and try?

Unfortunately I don't have a link, can you contact the Red Hat support? They might be able to help with this. I can't give you a kernel compiled by me if you are going to run it on production servers.

discostur commented 1 year ago

I don't have access to the newer kernel so i just tried out a mainline 6.4.7-1.el8.elrepo.x86_64 kernel from el-repo ... seems to be better, but still after some time i got errors and iscsi sessions were failing:

[Wed Aug  2 16:20:30 2023] Unable to locate RefTaskTag: 0x200000c5 on CID: 0.
[Wed Aug  2 16:20:30 2023] Unable to locate ITT: 0x20000048 on CID: 0
[Wed Aug  2 16:20:30 2023] Unable to locate RefTaskTag: 0x20000048 on CID: 0.
[Wed Aug  2 16:20:30 2023] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
[Wed Aug  2 16:20:30 2023] Received CmdSN: 0x8da2b653 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:20:40 2023] Unable to locate ITT: 0x2000013a on CID: 0
[Wed Aug  2 16:20:40 2023] Unable to locate RefTaskTag: 0x2000013a on CID: 0.
[Wed Aug  2 16:20:40 2023] Received CmdSN: 0x8da2b654 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:20:50 2023] Unable to locate ITT: 0x20000037 on CID: 0
[Wed Aug  2 16:20:50 2023] Unable to locate RefTaskTag: 0x20000037 on CID: 0.
[Wed Aug  2 16:20:50 2023] Received CmdSN: 0x8da2b655 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:21:00 2023] Unable to locate ITT: 0x2000000f on CID: 0
[Wed Aug  2 16:21:00 2023] Unable to locate RefTaskTag: 0x2000000f on CID: 0.
[Wed Aug  2 16:21:00 2023] Received CmdSN: 0x8da2b656 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:21:10 2023] Unable to locate ITT: 0x20000004 on CID: 0
[Wed Aug  2 16:21:10 2023] Unable to locate RefTaskTag: 0x20000004 on CID: 0.
[Wed Aug  2 16:21:10 2023] TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:ovirt04
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b739 is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73a is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73b is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73c is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73d is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73e is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73f is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b740 is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b741 is greater than MaxCmdSN: 0x8da2b738, ignoring.

Didnt' see the INFO: task kworker/3:1:147 blocked for more than 120 seconds. error. But it ends up the same ... the lock on the target LUNs is playing flip flop between my two iscsi target nodes:

 2023-08-02 16:18:30.017 81 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun2: Async lock drop. Old state 5
2023-08-02 16:18:30.023 81 [INFO] alua_implicit_transition:592 rbd/ovirt.lun2: Starting write lock acquisition operation.
2023-08-02 16:18:30.025 81 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun2: appended blocklist entry: {192.168.26.42:0/2358469675}
2023-08-02 16:18:30.094 81 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun2: removing blocklist entry: {192.168.26.42:0/2358469675}
2023-08-02 16:18:31.962 81 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun2: Acquired exclusive lock.
2023-08-02 16:18:31.962 81 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun2: Write lock acquisition successful
2023-08-02 16:18:53.035 81 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun2: Async lock drop. Old state 5
2023-08-02 16:18:53.042 81 [INFO] alua_implicit_transition:592 rbd/ovirt.lun2: Starting write lock acquisition operation.
2023-08-02 16:18:53.085 81 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun2: appended blocklist entry: {192.168.26.42:0/1681666617}
2023-08-02 16:18:53.219 81 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun2: removing blocklist entry: {192.168.26.42:0/1681666617}
2023-08-02 16:18:55.109 81 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun2: Acquired exclusive lock.
2023-08-02 16:18:55.110 81 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun2: Write lock acquisition successful
discostur commented 1 year ago

I still have another tcmu runner with 1.5.2 version which runs on 4.18.0-425.3.1.el8.x86_64 - without any problems.

Tried already to boot the old kernel on the second host but still get the errors. The only difference to me is

As long as i use the same ceph container version and the host has booted the same kernel it shouldn't make any difference, or? But i still cannot get it running after the el8.8 update ... even with booting the same kernel + running exactly the same ceph container / tcmu runner version.

strange

maurizio-lombardi commented 1 year ago

I still have another tcmu runner with 1.5.2 version which runs on 4.18.0-425.3.1.el8.x86_64 - without any problems.

Tried already to boot the old kernel on the second host but still get the errors. The only difference to me is

* working host: el 8.7

* broken host: el 8.8

As long as i use the same ceph container version and the host has booted the same kernel it shouldn't make any difference, or? But i still cannot get it running after the el8.8 update ... even with booting the same kernel + running exactly the same ceph container / tcmu runner version.

strange

It all depends on a particular type of scsi command getting aborted so you may see this kind of unexpected behaviour. Strangely, this bug went unnoticed for years.

I think that the patch that fixes your issue is upstream commit ea87981a0ee8fb8ced1c87d004a541b60623ff97

maurizio-lombardi commented 1 year ago

I don't have access to the newer kernel so i just tried out a mainline 6.4.7-1.el8.elrepo.x86_64 kernel from el-repo ... seems to be better, but still after some time i got errors and iscsi sessions were failing:

Didnt' see the INFO: task kworker/3:1:147 blocked for more than 120 seconds. error. But it ends up the same ... the lock on the target LUNs is playing flip flop between my two iscsi target nodes:

Ok, at least the kernel isn't hanging anymore. Maybe someone that knows tcmu_runner better can help with the debug messages you posted

lxbsz commented 1 year ago

I don't have access to the newer kernel so i just tried out a mainline 6.4.7-1.el8.elrepo.x86_64 kernel from el-repo ... seems to be better, but still after some time i got errors and iscsi sessions were failing:

[Wed Aug  2 16:20:30 2023] Unable to locate RefTaskTag: 0x200000c5 on CID: 0.
[Wed Aug  2 16:20:30 2023] Unable to locate ITT: 0x20000048 on CID: 0
[Wed Aug  2 16:20:30 2023] Unable to locate RefTaskTag: 0x20000048 on CID: 0.
[Wed Aug  2 16:20:30 2023] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
[Wed Aug  2 16:20:30 2023] Received CmdSN: 0x8da2b653 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:20:40 2023] Unable to locate ITT: 0x2000013a on CID: 0
[Wed Aug  2 16:20:40 2023] Unable to locate RefTaskTag: 0x2000013a on CID: 0.
[Wed Aug  2 16:20:40 2023] Received CmdSN: 0x8da2b654 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:20:50 2023] Unable to locate ITT: 0x20000037 on CID: 0
[Wed Aug  2 16:20:50 2023] Unable to locate RefTaskTag: 0x20000037 on CID: 0.
[Wed Aug  2 16:20:50 2023] Received CmdSN: 0x8da2b655 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:21:00 2023] Unable to locate ITT: 0x2000000f on CID: 0
[Wed Aug  2 16:21:00 2023] Unable to locate RefTaskTag: 0x2000000f on CID: 0.
[Wed Aug  2 16:21:00 2023] Received CmdSN: 0x8da2b656 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug  2 16:21:10 2023] Unable to locate ITT: 0x20000004 on CID: 0
[Wed Aug  2 16:21:10 2023] Unable to locate RefTaskTag: 0x20000004 on CID: 0.
[Wed Aug  2 16:21:10 2023] TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:ovirt04
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b739 is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73a is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73b is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73c is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73d is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73e is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b73f is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b740 is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug  2 16:21:12 2023] Received CmdSN: 0x8da2b741 is greater than MaxCmdSN: 0x8da2b738, ignoring.

Didnt' see the INFO: task kworker/3:1:147 blocked for more than 120 seconds. error. But it ends up the same ... the lock on the target LUNs is playing flip flop between my two iscsi target nodes:

This will happen when client tried to send requests to 2 paths alternately.

Could enable the highest debug level, let's see what commands are sent ?

discostur commented 1 year ago

@lxbsz i was able to catch some debug logs from tcmu-runner for the following kernel log (so that you can compare the timestamps):

[Fri Aug  4 11:45:59 2023] tcmu daemon: command reply support 1.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e42 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e43 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e44 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e45 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e46 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e47 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e48 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e49 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4a is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4b is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4c is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4d is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4e is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4f is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e50 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e51 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e52 is greater than MaxCmdSN: 0x8daa5e47, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7c is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7d is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7e is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7f is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e80 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e81 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e82 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.

Debug logs are attached als txt file:

tcmu-runner.log

discostur commented 1 year ago

@lxbsz Just gather another log / debug output:

Aug  5 17:01:55 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000115 on CID: 0.
Aug  5 17:01:55 ceph-iscsi02 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
Aug  5 17:01:55 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000110 on CID: 0
Aug  5 17:01:55 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000110 on CID: 0.
Aug  5 17:01:55 ceph-iscsi02 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
Aug  5 17:02:31 ceph-iscsi02 kernel: Unable to locate ITT: 0x3000001a on CID: 0
Aug  5 17:02:31 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x3000001a on CID: 0.
Aug  5 17:02:31 ceph-iscsi02 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
Aug  5 17:02:32 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec6 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug  5 17:02:42 ceph-iscsi02 kernel: Unable to locate ITT: 0x300000e1 on CID: 0
Aug  5 17:02:42 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x300000e1 on CID: 0.
Aug  5 17:02:43 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec7 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug  5 17:02:53 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000123 on CID: 0
Aug  5 17:02:53 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000123 on CID: 0.
Aug  5 17:02:53 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec8 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug  5 17:03:03 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000120 on CID: 0
Aug  5 17:03:03 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000120 on CID: 0.
Aug  5 17:03:03 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec9 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug  5 17:03:13 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000112 on CID: 0
Aug  5 17:03:13 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000112 on CID: 0.
Aug  5 17:03:13 ceph-iscsi02 kernel: Received CmdSN: 0x8db45eca is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug  5 17:03:23 ceph-iscsi02 kernel: Unable to locate ITT: 0x300000c6 on CID: 0
Aug  5 17:03:23 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x300000c6 on CID: 0.
Aug  5 17:03:23 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ecb is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug  5 17:03:33 ceph-iscsi02 kernel: Unable to locate ITT: 0x3000007b on CID: 0
Aug  5 17:03:33 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x3000007b on CID: 0.
Aug  5 17:03:33 ceph-iscsi02 kernel: TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:ovirt04
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f98 is greater than MaxCmdSN: 0x8db45f97, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f99 is greater than MaxCmdSN: 0x8db45f97, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9a is greater than MaxCmdSN: 0x8db45f98, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9b is greater than MaxCmdSN: 0x8db45f98, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9e is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9f is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa0 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa1 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa2 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa3 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa4 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa5 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa6 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa7 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa8 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa9 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45faa is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fab is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fac is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fad is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fae is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45faf is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fb0 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug  5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fb1 is greater than MaxCmdSN: 0x8db45f9d, ignoring.

Debug log attached. As you can see it runs without any problems for about 18 - 20 minutes an then the trouble begins:

tcmu-runner.log.zip

lxbsz commented 1 year ago

@lxbsz i was able to catch some debug logs from tcmu-runner for the following kernel log (so that you can compare the timestamps):

[Fri Aug  4 11:45:59 2023] tcmu daemon: command reply support 1.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e42 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e43 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e44 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e45 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e46 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e47 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e48 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e49 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4a is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4b is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4c is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4d is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4e is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e4f is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e50 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e51 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug  4 11:47:58 2023] Received CmdSN: 0x8daa5e52 is greater than MaxCmdSN: 0x8daa5e47, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7c is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7d is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7e is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e7f is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e80 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e81 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:47:59 2023] Received CmdSN: 0x8daa5e82 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug  4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug  4 11:48:07 2023] iSCSI Login negotiation failed.

Debug logs are attached als txt file:

tcmu-runner.log

From this logs, it seems in the client side you were issuing the IOs to two gateways alternatively. The configure seems incorrect. BTW, were you trying to do AA for the two paths ?

discostur commented 1 year ago

@lxbsz correct, that was a ghosted LUN on one of the clients. The active path was ghosted and the client always sent IO to the backup path ... then the tcmu-runner nodes played ping pong / failovered the LUN every second.

I fixed that and the second log output should be better with all hosts showing correct path:

36001405ad900b299e6741439e4414ab4 dm-0 LIO-ORG ,TCMU device     
size=5.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='queue-length 0' prio=50 status=active
| `- 15:0:0:0 sdb 8:16  active ready running
`-+- policy='queue-length 0' prio=10 status=enabled
  `- 16:0:0:0 sdh 8:112 active ready running
discostur commented 1 year ago

It seems like after a complete reboot of the whole infrastructure (including oVirt cluster / iscsi clients) the error does not appear anymore ...

Strange.