tcmu-runner auto restart when timeout and unfound targets.

YMhuan commented 3 years ago

Hello!

tcmu version: tcmu-runner-1.4.0
Linux version & kernal version: CentOS7.7.1908 kernel 3.10.0-1127
Ceph version: 13.2.8 minic 272 OSD in 8 storage nodes 32 rbd-volumes(LUNs) in 8 clients; 8 gateways(1 client log in 1 gw); 1targets
cluser IO: 650MB/s 1K IOps write

What I do To simulate pg inactive as well as watching the relative changes in cluster IO and iscsi LUN.
What I expected to see LUN time out -> dissconnected -> re-login BUT, what didn't perform as expected: (1).Some tcmu-runner service seems restart after timeout.(4/8 tcmu-runner service were restarted.)

(2).Cannot discovery target using gateway IP:port (see e.g)

e.g: The vm-node11(client) was connected to 66.66.66.2:3260,but after the tcmu-runner auto restart, the vm can't discovery targets from command. ----iscsiadm -m discovery -t sendtargets -p 66.66.66.2:3260 Command out put on client： no_gw

Appendix： Last 25 lines log for tcmu-runner at 66.66.66.2 :

25970 2021-07-26 14:47:10.815 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25971 2021-07-26 14:47:10.826 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25972 2021-07-26 14:47:10.840 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25973 2021-07-26 14:47:10.840 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-1: Starting lock acquisition operation. 25974 2021-07-26 14:47:10.848 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported 25975 2021-07-26 14:47:10.849 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-2: Starting lock acquisition operation. 25976 2021-07-26 14:47:10.857 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-3: Starting lock acquisition operation. 25977 2021-07-26 14:47:10.865 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-4: Starting lock acquisition operation. 25978 2021-07-26 14:47:10.921 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-1: Acquired exclusive lock. 25979 2021-07-26 14:47:10.927 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-2: Acquired exclusive lock. 25980 2021-07-26 14:47:10.997 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-4: Acquired exclusive lock. 25981 2021-07-26 14:47:11.013 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-3: Acquired exclusive lock. 25982 2021-07-26 14:53:21.238 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25983 2021-07-26 14:53:21.238 1825526 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-11-3: Handler connection lost (lock state 1) 25984 2021-07-26 14:53:21.239 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25985 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25986 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25987 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25988 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd. 25992 2021-07-26 14:53:38.097 1825526 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.fixxxxxxe.iscsi-gw:xxxtor/tpgt_7. 25993 2021-07-26 14:54:41.430 1832626 [INFO] dyn_config_start:422: Inotify is watching "/etc/tcmu/tcmu.conf", wd: 1, mask: IN_ALL_EVENTS LOG END AT HERE FOR 3 HOURS.

END

Thank you for reading~

YMhuan commented 3 years ago

As I tried to restart the tcmu-runner & rbd-target-api at 66.66.66.2 it fixed.

lxbsz commented 3 years ago

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?

25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.

The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?

YMhuan commented 3 years ago

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?
25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?

The settings of tcmu-runner service as follows:

[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target

[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner

[Install]
WantedBy=multi-user.target

I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.

In ceph cluster I got one pg inactive. In other nodes(which tcmu-runner not auto restart),I found log like this:

2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
2021-07-26 14:47:47.649 4050954 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2021-04.com.xxxxxxxx.iscsi-gw:xxxx/tpgt_3.
2021-07-26 14:48:21.371 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.384 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.437 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.442 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported

I expected the gw will auto connect(log) afer disconnect by time out, but i cant discovery or login to the gw on the node which tcmu-runner auto-restarted. Is there something I missed to take it into account.

lxbsz commented 3 years ago

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?
25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?
The settings of tcmu-runner service as follows:
[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target

[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner

[Install]
WantedBy=multi-user.target
I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.

Are you using container or something else ? I know in some container envs the container service will detect the tcmu-runner service's status, if it's dead the container will try to pull it up.

As default the tcmu-runner itself won't auto-restart by systemd service, so I am curious how this could happen for you ?

In ceph cluster I got one pg inactive. In other nodes(which tcmu-runner not auto restart),I found log like this:


2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.

Were you building the tcmu-runner from source ? Why in this node the HA not supported, this usually due to you are using the lower ceph version, which doesn't support the exclusive lock, when building the tcmu-runner.

While in another node, you are building the tcmu-runner with a higher ceph version ????

YMhuan commented 3 years ago

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?
25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?
The settings of tcmu-runner service as follows:
[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target

[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner

[Install]
WantedBy=multi-user.target
I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.
Are you using container or something else ? I know in some container envs the container service will detect the tcmu-runner service's status, if it's dead the container will try to pull it up.

As default the tcmu-runner itself won't auto-restart by systemd service, so I am curious how this could happen for you ?
In ceph cluster I got one pg inactive. In other nodes(which tcmu-runner not auto restart),I found log like this:
2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
Were you building the tcmu-runner from source ? Why in this node the HA not supported, this usually due to you are using the lower ceph version, which doesn't support the exclusive lock, when building the tcmu-runner.

While in another node, you are building the tcmu-runner with a higher ceph version ????

Thank you for replying. The ceph cluster version is 13.2.8 in all storage nodes. tcmu-runner was build by 1.4.0 source code 'HA not supported' occoured in "another node", which I didn't pasted yesterday. just like:

2021-07-27 15:04:47.234 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 262144, data offset: 7866031341568, elapsed_usec: 964577
2021-07-27 15:04:47.251 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7623567015936, elapsed_usec: 590750
2021-07-27 15:04:47.251 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxxx/tpgt_3.
2021-07-27 15:04:47.396 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
2021-07-27 15:04:47.396 4050954 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxxx/tpgt_3.
2021-07-27 15:05:46.317 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.337 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.348 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.361 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.454 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 0, elapsed_usec: 93241
2021-07-27 15:05:46.489 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 4096, elapsed_usec: 34772

YMhuan commented 3 years ago

What I can find on gateway node（storage node） is.

[root@gateway6 ~]# netstat -anp |grep 3260
tcp        0      0 66.66.66.1:3260         0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.3:3260         0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.2:3260         0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.170:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.169:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.168:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.167:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.166:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47513    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46572    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47373    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47443    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47225    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46863    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47005    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47586    FIN_WAIT2   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47293    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46722    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46933    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46430    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:33260        66.66.66.168:7159       ESTABLISHED 3000906/ceph-osd    
tcp        0      0 66.66.66.1:3260         172.168.188.19:47152    TIME_WAIT   -                   
tcp        0      0 88.88.88.1:7131         88.88.88.168:33260      ESTABLISHED 2999805/ceph-osd    
tcp        0      0 88.88.88.1:7157         88.88.88.169:33260      ESTABLISHED 3007121/ceph-osd    
tcp        0      0 66.66.66.1:3260         172.168.188.19:46651    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47081    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46504    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46351    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46791    TIME_WAIT   -

What in client iscsid service

[root@vm-node11 ~]# systemctl status iscsid
● iscsid.service - Open-iSCSI
   Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-07-24 11:00:33 CST; 4 days ago
     Docs: man:iscsid(8)
           man:iscsiadm(8)
 Main PID: 4459 (iscsid)
   CGroup: /system.slice/iscsid.service
           ├─4458 /usr/sbin/iscsid
           └─4459 /usr/sbin/iscsid

Jul 28 14:47:33 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:36 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:39 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:42 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:45 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:48 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:51 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:54 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:57 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:48:00 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)

YMhuan commented 3 years ago

I will set "log_level = 5" in /etc/tcmu/tcmu.conf and rebuild this error

open-iscsi / tcmu-runner

tcmu-runner auto restart when timeout and unfound targets. #665