Open YMhuan opened 3 years ago
As I tried to restart the tcmu-runner & rbd-target-api at 66.66.66.2 it fixed.
What's your tcmu-runner.service
settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?
25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the rados_osd_op_timeout
to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?
What's your
tcmu-runner.service
settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the
rados_osd_op_timeout
to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?
The settings of tcmu-runner service as follows:
[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target
[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner
[Install]
WantedBy=multi-user.target
I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.
In ceph cluster I got one pg inactive. In other nodes(which tcmu-runner not auto restart),I found log like this:
2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
2021-07-26 14:47:47.649 4050954 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2021-04.com.xxxxxxxx.iscsi-gw:xxxx/tpgt_3.
2021-07-26 14:48:21.371 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.384 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.437 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.442 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
I expected the gw will auto connect(log) afer disconnect by time out, but i cant discovery or login to the gw on the node which tcmu-runner auto-restarted. Is there something I missed to take it into account.
What's your
tcmu-runner.service
settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the
rados_osd_op_timeout
to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?The settings of tcmu-runner service as follows:
[Unit] Description=LIO Userspace-passthrough daemon After=network.target [Service] LimitNOFILE=1000000 Type=dbus BusName=org.kernel.TCMUService1 KillMode=process ExecStart=/usr/bin/tcmu-runner [Install] WantedBy=multi-user.target
I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.
Are you using container or something else ? I know in some container envs the container service will detect the tcmu-runner service's status, if it's dead the container will try to pull it up.
As default the tcmu-runner itself won't auto-restart by systemd service, so I am curious how this could happen for you ?
In ceph cluster I got one pg inactive. In other nodes(which tcmu-runner not auto restart),I found log like this:
2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276 2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293 2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd. 2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0) 2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306 2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd. 2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921 2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd. 2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0) 2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880 2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd. 2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0) 2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675 2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3. 2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported. 2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported. 2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
Were you building the tcmu-runner from source ?
Why in this node the HA not supported
, this usually due to you are using the lower ceph version, which doesn't support the exclusive lock, when building the tcmu-runner.
While in another node, you are building the tcmu-runner with a higher ceph version ????
What's your
tcmu-runner.service
settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the
rados_osd_op_timeout
to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?The settings of tcmu-runner service as follows:
[Unit] Description=LIO Userspace-passthrough daemon After=network.target [Service] LimitNOFILE=1000000 Type=dbus BusName=org.kernel.TCMUService1 KillMode=process ExecStart=/usr/bin/tcmu-runner [Install] WantedBy=multi-user.target
I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.
Are you using container or something else ? I know in some container envs the container service will detect the tcmu-runner service's status, if it's dead the container will try to pull it up.
As default the tcmu-runner itself won't auto-restart by systemd service, so I am curious how this could happen for you ?
In ceph cluster I got one pg inactive. In other nodes(which tcmu-runner not auto restart),I found log like this:
2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276 2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293 2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd. 2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0) 2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306 2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd. 2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921 2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd. 2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0) 2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880 2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd. 2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0) 2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675 2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3. 2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported. 2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported. 2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
Were you building the tcmu-runner from source ? Why in this node the
HA not supported
, this usually due to you are using the lower ceph version, which doesn't support the exclusive lock, when building the tcmu-runner.While in another node, you are building the tcmu-runner with a higher ceph version ????
Thank you for replying. The ceph cluster version is 13.2.8 in all storage nodes. tcmu-runner was build by 1.4.0 source code 'HA not supported' occoured in "another node", which I didn't pasted yesterday. just like:
2021-07-27 15:04:47.234 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 262144, data offset: 7866031341568, elapsed_usec: 964577
2021-07-27 15:04:47.251 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7623567015936, elapsed_usec: 590750
2021-07-27 15:04:47.251 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxxx/tpgt_3.
2021-07-27 15:04:47.396 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
2021-07-27 15:04:47.396 4050954 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxxx/tpgt_3.
2021-07-27 15:05:46.317 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.337 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.348 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.361 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.454 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 0, elapsed_usec: 93241
2021-07-27 15:05:46.489 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 4096, elapsed_usec: 34772
What I can find on gateway node(storage node) is.
[root@gateway6 ~]# netstat -anp |grep 3260
tcp 0 0 66.66.66.1:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.3:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.2:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.170:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.169:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.168:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.167:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.166:3260 0.0.0.0:* LISTEN -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47513 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46572 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47373 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47443 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47225 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46863 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47005 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47586 FIN_WAIT2 -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47293 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46722 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46933 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46430 TIME_WAIT -
tcp 0 0 66.66.66.1:33260 66.66.66.168:7159 ESTABLISHED 3000906/ceph-osd
tcp 0 0 66.66.66.1:3260 172.168.188.19:47152 TIME_WAIT -
tcp 0 0 88.88.88.1:7131 88.88.88.168:33260 ESTABLISHED 2999805/ceph-osd
tcp 0 0 88.88.88.1:7157 88.88.88.169:33260 ESTABLISHED 3007121/ceph-osd
tcp 0 0 66.66.66.1:3260 172.168.188.19:46651 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:47081 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46504 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46351 TIME_WAIT -
tcp 0 0 66.66.66.1:3260 172.168.188.19:46791 TIME_WAIT -
What in client iscsid service
[root@vm-node11 ~]# systemctl status iscsid
● iscsid.service - Open-iSCSI
Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2021-07-24 11:00:33 CST; 4 days ago
Docs: man:iscsid(8)
man:iscsiadm(8)
Main PID: 4459 (iscsid)
CGroup: /system.slice/iscsid.service
├─4458 /usr/sbin/iscsid
└─4459 /usr/sbin/iscsid
Jul 28 14:47:33 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:36 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:39 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:42 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:45 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:48 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:51 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:54 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:57 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:48:00 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
I will set "log_level = 5" in /etc/tcmu/tcmu.conf and rebuild this error
Hello!
tcmu version: tcmu-runner-1.4.0
Linux version & kernal version: CentOS7.7.1908 kernel 3.10.0-1127
Ceph version: 13.2.8 minic 272 OSD in 8 storage nodes 32 rbd-volumes(LUNs) in 8 clients; 8 gateways(1 client log in 1 gw); 1targets
cluser IO: 650MB/s 1K IOps write
What I do To simulate pg inactive as well as watching the relative changes in cluster IO and iscsi LUN.
What I expected to see LUN time out -> dissconnected -> re-login BUT, what didn't perform as expected: (1).Some tcmu-runner service seems restart after timeout.(4/8 tcmu-runner service were restarted.)
(2).Cannot discovery target using gateway IP:port (see e.g)
e.g: The vm-node11(client) was connected to 66.66.66.2:3260,but after the tcmu-runner auto restart, the vm can't discovery targets from command. ----iscsiadm -m discovery -t sendtargets -p 66.66.66.2:3260 Command out put on client:
25970 2021-07-26 14:47:10.815 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25971 2021-07-26 14:47:10.826 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25972 2021-07-26 14:47:10.840 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25973 2021-07-26 14:47:10.840 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-1: Starting lock acquisition operation. 25974 2021-07-26 14:47:10.848 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25975 2021-07-26 14:47:10.849 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-2: Starting lock acquisition operation. 25976 2021-07-26 14:47:10.857 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-3: Starting lock acquisition operation. 25977 2021-07-26 14:47:10.865 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-4: Starting lock acquisition operation. 25978 2021-07-26 14:47:10.921 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-1: Acquired exclusive lock. 25979 2021-07-26 14:47:10.927 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-2: Acquired exclusive lock. 25980 2021-07-26 14:47:10.997 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-4: Acquired exclusive lock. 25981 2021-07-26 14:47:11.013 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-3: Acquired exclusive lock. 25982 2021-07-26 14:53:21.238 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25983 2021-07-26 14:53:21.238 1825526 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-11-3: Handler connection lost (lock state 1) 25984 2021-07-26 14:53:21.239 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25985 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25986 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25987 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25988 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25992 2021-07-26 14:53:38.097 1825526 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.fixxxxxxe.iscsi-gw:xxxtor/tpgt_7. 25993 2021-07-26 14:54:41.430 1832626 [INFO] dyn_config_start:422: Inotify is watching "/etc/tcmu/tcmu.conf", wd: 1, mask: IN_ALL_EVENTS LOG END AT HERE FOR 3 HOURS.
END
Thank you for reading~