Closed discostur closed 1 year ago
Is it possible for you to download and test this kernel? kernel-4.18.0-477.16.1.el8_8 I remember there was a fix to prevent aborted commands from hanging the kernel and block re-logins
@maurizio-lombardi just checked i'm on el 8.8 with the latest kernel 4.18.0-477.15.1.el8_8.x86_64
... the 16.1
is not yet available. Do you have any sources where i can download it and try?
@maurizio-lombardi just checked i'm on el 8.8 with the latest kernel
4.18.0-477.15.1.el8_8.x86_64
... the16.1
is not yet available. Do you have any sources where i can download it and try?
Unfortunately I don't have a link, can you contact the Red Hat support? They might be able to help with this. I can't give you a kernel compiled by me if you are going to run it on production servers.
I don't have access to the newer kernel so i just tried out a mainline 6.4.7-1.el8.elrepo.x86_64
kernel from el-repo ... seems to be better, but still after some time i got errors and iscsi sessions were failing:
[Wed Aug 2 16:20:30 2023] Unable to locate RefTaskTag: 0x200000c5 on CID: 0.
[Wed Aug 2 16:20:30 2023] Unable to locate ITT: 0x20000048 on CID: 0
[Wed Aug 2 16:20:30 2023] Unable to locate RefTaskTag: 0x20000048 on CID: 0.
[Wed Aug 2 16:20:30 2023] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
[Wed Aug 2 16:20:30 2023] Received CmdSN: 0x8da2b653 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug 2 16:20:40 2023] Unable to locate ITT: 0x2000013a on CID: 0
[Wed Aug 2 16:20:40 2023] Unable to locate RefTaskTag: 0x2000013a on CID: 0.
[Wed Aug 2 16:20:40 2023] Received CmdSN: 0x8da2b654 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug 2 16:20:50 2023] Unable to locate ITT: 0x20000037 on CID: 0
[Wed Aug 2 16:20:50 2023] Unable to locate RefTaskTag: 0x20000037 on CID: 0.
[Wed Aug 2 16:20:50 2023] Received CmdSN: 0x8da2b655 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug 2 16:21:00 2023] Unable to locate ITT: 0x2000000f on CID: 0
[Wed Aug 2 16:21:00 2023] Unable to locate RefTaskTag: 0x2000000f on CID: 0.
[Wed Aug 2 16:21:00 2023] Received CmdSN: 0x8da2b656 is greater than MaxCmdSN: 0x8da2b58c, ignoring.
[Wed Aug 2 16:21:10 2023] Unable to locate ITT: 0x20000004 on CID: 0
[Wed Aug 2 16:21:10 2023] Unable to locate RefTaskTag: 0x20000004 on CID: 0.
[Wed Aug 2 16:21:10 2023] TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:ovirt04
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b739 is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73a is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73b is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73c is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73d is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73e is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73f is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b740 is greater than MaxCmdSN: 0x8da2b738, ignoring.
[Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b741 is greater than MaxCmdSN: 0x8da2b738, ignoring.
Didnt' see the INFO: task kworker/3:1:147 blocked for more than 120 seconds.
error. But it ends up the same ... the lock on the target LUNs is playing flip flop between my two iscsi target nodes:
2023-08-02 16:18:30.017 81 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun2: Async lock drop. Old state 5
2023-08-02 16:18:30.023 81 [INFO] alua_implicit_transition:592 rbd/ovirt.lun2: Starting write lock acquisition operation.
2023-08-02 16:18:30.025 81 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun2: appended blocklist entry: {192.168.26.42:0/2358469675}
2023-08-02 16:18:30.094 81 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun2: removing blocklist entry: {192.168.26.42:0/2358469675}
2023-08-02 16:18:31.962 81 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun2: Acquired exclusive lock.
2023-08-02 16:18:31.962 81 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun2: Write lock acquisition successful
2023-08-02 16:18:53.035 81 [WARN] tcmu_notify_lock_lost:271 rbd/ovirt.lun2: Async lock drop. Old state 5
2023-08-02 16:18:53.042 81 [INFO] alua_implicit_transition:592 rbd/ovirt.lun2: Starting write lock acquisition operation.
2023-08-02 16:18:53.085 81 [INFO] tcmu_rbd_close:1199 rbd/ovirt.lun2: appended blocklist entry: {192.168.26.42:0/1681666617}
2023-08-02 16:18:53.219 81 [INFO] tcmu_rbd_rm_stale_entries_from_blacklist:341 rbd/ovirt.lun2: removing blocklist entry: {192.168.26.42:0/1681666617}
2023-08-02 16:18:55.109 81 [INFO] tcmu_rbd_lock:948 rbd/ovirt.lun2: Acquired exclusive lock.
2023-08-02 16:18:55.110 81 [INFO] tcmu_acquire_dev_lock:489 rbd/ovirt.lun2: Write lock acquisition successful
I still have another tcmu runner with 1.5.2
version which runs on 4.18.0-425.3.1.el8.x86_64
- without any problems.
Tried already to boot the old kernel on the second host but still get the errors. The only difference to me is
As long as i use the same ceph container version and the host has booted the same kernel it shouldn't make any difference, or? But i still cannot get it running after the el8.8 update ... even with booting the same kernel + running exactly the same ceph container / tcmu runner version.
strange
I still have another tcmu runner with
1.5.2
version which runs on4.18.0-425.3.1.el8.x86_64
- without any problems.Tried already to boot the old kernel on the second host but still get the errors. The only difference to me is
* working host: el 8.7 * broken host: el 8.8
As long as i use the same ceph container version and the host has booted the same kernel it shouldn't make any difference, or? But i still cannot get it running after the el8.8 update ... even with booting the same kernel + running exactly the same ceph container / tcmu runner version.
strange
It all depends on a particular type of scsi command getting aborted so you may see this kind of unexpected behaviour. Strangely, this bug went unnoticed for years.
I think that the patch that fixes your issue is upstream commit ea87981a0ee8fb8ced1c87d004a541b60623ff97
I don't have access to the newer kernel so i just tried out a mainline
6.4.7-1.el8.elrepo.x86_64
kernel from el-repo ... seems to be better, but still after some time i got errors and iscsi sessions were failing:Didnt' see the
INFO: task kworker/3:1:147 blocked for more than 120 seconds.
error. But it ends up the same ... the lock on the target LUNs is playing flip flop between my two iscsi target nodes:
Ok, at least the kernel isn't hanging anymore. Maybe someone that knows tcmu_runner better can help with the debug messages you posted
I don't have access to the newer kernel so i just tried out a mainline
6.4.7-1.el8.elrepo.x86_64
kernel from el-repo ... seems to be better, but still after some time i got errors and iscsi sessions were failing:[Wed Aug 2 16:20:30 2023] Unable to locate RefTaskTag: 0x200000c5 on CID: 0. [Wed Aug 2 16:20:30 2023] Unable to locate ITT: 0x20000048 on CID: 0 [Wed Aug 2 16:20:30 2023] Unable to locate RefTaskTag: 0x20000048 on CID: 0. [Wed Aug 2 16:20:30 2023] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0 [Wed Aug 2 16:20:30 2023] Received CmdSN: 0x8da2b653 is greater than MaxCmdSN: 0x8da2b58c, ignoring. [Wed Aug 2 16:20:40 2023] Unable to locate ITT: 0x2000013a on CID: 0 [Wed Aug 2 16:20:40 2023] Unable to locate RefTaskTag: 0x2000013a on CID: 0. [Wed Aug 2 16:20:40 2023] Received CmdSN: 0x8da2b654 is greater than MaxCmdSN: 0x8da2b58c, ignoring. [Wed Aug 2 16:20:50 2023] Unable to locate ITT: 0x20000037 on CID: 0 [Wed Aug 2 16:20:50 2023] Unable to locate RefTaskTag: 0x20000037 on CID: 0. [Wed Aug 2 16:20:50 2023] Received CmdSN: 0x8da2b655 is greater than MaxCmdSN: 0x8da2b58c, ignoring. [Wed Aug 2 16:21:00 2023] Unable to locate ITT: 0x2000000f on CID: 0 [Wed Aug 2 16:21:00 2023] Unable to locate RefTaskTag: 0x2000000f on CID: 0. [Wed Aug 2 16:21:00 2023] Received CmdSN: 0x8da2b656 is greater than MaxCmdSN: 0x8da2b58c, ignoring. [Wed Aug 2 16:21:10 2023] Unable to locate ITT: 0x20000004 on CID: 0 [Wed Aug 2 16:21:10 2023] Unable to locate RefTaskTag: 0x20000004 on CID: 0. [Wed Aug 2 16:21:10 2023] TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:ovirt04 [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b739 is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73a is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73b is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73c is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73d is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73e is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b73f is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b740 is greater than MaxCmdSN: 0x8da2b738, ignoring. [Wed Aug 2 16:21:12 2023] Received CmdSN: 0x8da2b741 is greater than MaxCmdSN: 0x8da2b738, ignoring.
Didnt' see the
INFO: task kworker/3:1:147 blocked for more than 120 seconds.
error. But it ends up the same ... the lock on the target LUNs is playing flip flop between my two iscsi target nodes:
This will happen when client tried to send requests to 2 paths alternately.
Could enable the highest debug level, let's see what commands are sent ?
@lxbsz i was able to catch some debug logs from tcmu-runner for the following kernel log (so that you can compare the timestamps):
[Fri Aug 4 11:45:59 2023] tcmu daemon: command reply support 1.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e42 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e43 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e44 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e45 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e46 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e47 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e48 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e49 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4a is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4b is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4c is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4d is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4e is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4f is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e50 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e51 is greater than MaxCmdSN: 0x8daa5e41, ignoring.
[Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e52 is greater than MaxCmdSN: 0x8daa5e47, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7c is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7d is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7e is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7f is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e80 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e81 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e82 is greater than MaxCmdSN: 0x8daa5e7b, ignoring.
[Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed.
[Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw
[Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed.
Debug logs are attached als txt file:
@lxbsz Just gather another log / debug output:
Aug 5 17:01:55 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000115 on CID: 0.
Aug 5 17:01:55 ceph-iscsi02 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
Aug 5 17:01:55 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000110 on CID: 0
Aug 5 17:01:55 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000110 on CID: 0.
Aug 5 17:01:55 ceph-iscsi02 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
Aug 5 17:02:31 ceph-iscsi02 kernel: Unable to locate ITT: 0x3000001a on CID: 0
Aug 5 17:02:31 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x3000001a on CID: 0.
Aug 5 17:02:31 ceph-iscsi02 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
Aug 5 17:02:32 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec6 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug 5 17:02:42 ceph-iscsi02 kernel: Unable to locate ITT: 0x300000e1 on CID: 0
Aug 5 17:02:42 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x300000e1 on CID: 0.
Aug 5 17:02:43 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec7 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug 5 17:02:53 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000123 on CID: 0
Aug 5 17:02:53 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000123 on CID: 0.
Aug 5 17:02:53 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec8 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug 5 17:03:03 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000120 on CID: 0
Aug 5 17:03:03 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000120 on CID: 0.
Aug 5 17:03:03 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ec9 is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug 5 17:03:13 ceph-iscsi02 kernel: Unable to locate ITT: 0x30000112 on CID: 0
Aug 5 17:03:13 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x30000112 on CID: 0.
Aug 5 17:03:13 ceph-iscsi02 kernel: Received CmdSN: 0x8db45eca is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug 5 17:03:23 ceph-iscsi02 kernel: Unable to locate ITT: 0x300000c6 on CID: 0
Aug 5 17:03:23 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x300000c6 on CID: 0.
Aug 5 17:03:23 ceph-iscsi02 kernel: Received CmdSN: 0x8db45ecb is greater than MaxCmdSN: 0x8db45e04, ignoring.
Aug 5 17:03:33 ceph-iscsi02 kernel: Unable to locate ITT: 0x3000007b on CID: 0
Aug 5 17:03:33 ceph-iscsi02 kernel: Unable to locate RefTaskTag: 0x3000007b on CID: 0.
Aug 5 17:03:33 ceph-iscsi02 kernel: TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:ovirt04
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f98 is greater than MaxCmdSN: 0x8db45f97, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f99 is greater than MaxCmdSN: 0x8db45f97, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9a is greater than MaxCmdSN: 0x8db45f98, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9b is greater than MaxCmdSN: 0x8db45f98, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9e is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45f9f is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa0 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa1 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa2 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa3 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa4 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa5 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa6 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa7 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa8 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fa9 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45faa is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fab is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fac is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fad is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fae is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45faf is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fb0 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Aug 5 17:03:35 ceph-iscsi02 kernel: Received CmdSN: 0x8db45fb1 is greater than MaxCmdSN: 0x8db45f9d, ignoring.
Debug log attached. As you can see it runs without any problems for about 18 - 20 minutes an then the trouble begins:
@lxbsz i was able to catch some debug logs from tcmu-runner for the following kernel log (so that you can compare the timestamps):
[Fri Aug 4 11:45:59 2023] tcmu daemon: command reply support 1. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e42 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e43 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e44 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e45 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e46 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e47 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e48 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e49 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4a is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4b is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4c is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4d is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4e is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e4f is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e50 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e51 is greater than MaxCmdSN: 0x8daa5e41, ignoring. [Fri Aug 4 11:47:58 2023] Received CmdSN: 0x8daa5e52 is greater than MaxCmdSN: 0x8daa5e47, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7c is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7d is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7e is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e7f is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e80 is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e81 is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:47:59 2023] Received CmdSN: 0x8daa5e82 is greater than MaxCmdSN: 0x8daa5e7b, ignoring. [Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:05 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:05 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed. [Fri Aug 4 11:48:07 2023] Unable to locate Target Portal Group on iqn.2019-03.com.redhat.iscsi-gw:ceph-igw [Fri Aug 4 11:48:07 2023] iSCSI Login negotiation failed.
Debug logs are attached als txt file:
From this logs, it seems in the client side you were issuing the IOs to two gateways alternatively. The configure seems incorrect. BTW, were you trying to do AA for the two paths ?
@lxbsz correct, that was a ghosted
LUN on one of the clients. The active path was ghosted and the client always sent IO to the backup path ... then the tcmu-runner nodes played ping pong / failovered the LUN every second.
I fixed that and the second log output should be better with all hosts showing correct path:
36001405ad900b299e6741439e4414ab4 dm-0 LIO-ORG ,TCMU device
size=5.0T features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='queue-length 0' prio=50 status=active
| `- 15:0:0:0 sdb 8:16 active ready running
`-+- policy='queue-length 0' prio=10 status=enabled
`- 16:0:0:0 sdh 8:112 active ready running
It seems like after a complete reboot of the whole infrastructure (including oVirt cluster / iscsi clients) the error does not appear anymore ...
Strange.
Hi,
i'm running tcmu-runner 1.6.0 with ceph 16.x and since a few hours im getting strange error messages:
Any idea what to do?
Thanks, Regards Kilian