Open karibertils opened 3 years ago
Hello, thanks for the report. This looks like a kernel bug in the target driver, I will look at it soon
@karibertils Can you please post a link to the entire dmesg output?
@maurizio-lombardi Here is the log for the whole day when issue occurred last time. kernel.2020.12.28.log
There were some PC's trying to attach non existing targets at the time. Don't know if that could be related to the issue ?
I have been rebooting every other day since to avoid the issue. I can skip reboots to gather new logs if that helps. Maybe enable some debug mode also ?
No news on this bug ?
I have tried also Ubuntu Server 20.10. Using kernel 5.8.0-44-generic
and targetcli-fb 2.1.53
and same thing happens there also.
@karibertils Hi, I am trying to reproduce it.
"Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also."
I am sure the bug is still present in the latest upstream kernel. Question: it happens when a number of initiators try to connect to a non-existing target and in the meanwhile you delete another target via targetcli? Correct?
@maurizio-lombardi No I believe it happens regardless of initiators connecting to non-existing targets or not.
When the issue starts every targetcli command hangs. Doing targetcli ls
or even just targetcli
to open the cli interface hangs.
I am network booting 80 PC's. All of them have 1 target which has 2 LUN's. LUN 0 is almost never changed. But LUN 1 is removed and re-added every time the PC's boot.
example:
targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns/ delete lun=1'
targetcli '/backstores/block/ delete games-192-168-101-152'
zfs destroy -R speedy/games/192-168-101-152
zfs destroy -R speedy/games/master@192-168-101-152
zfs snapshot speedy/games/master@192-168-101-152
zfs clone speedy/games/master@192-168-101-152 speedy/games/192-168-101-152
targetcli '/backstores/block/ create games-192-168-101-152 /dev/zvol/speedy/games/192-168-101-152'
targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns create lun=1 /backstores/block/games-192-168-101-152'
We did previously remove and re-add LUN 1 once every 24 hours. And the issue happened with similar frequency then. It can happen after running for 1-9 days. Usually it takes 3+ days though.
Hmm, I asked because of the following backtrace
Dec 28 17:17:16 rocky kernel: [361650.549542] INFO: task targetcli:3414923 blocked for more than 120 seconds. Dec 28 17:17:16 rocky kernel: [361650.558420] Tainted: P O 5.4.78-2-pve #1 Dec 28 17:17:16 rocky kernel: [361650.566177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 28 17:17:16 rocky kernel: [361650.575912] targetcli D 0 3414923 3406117 0x00004000 Dec 28 17:17:16 rocky kernel: [361650.583247] Call Trace: Dec 28 17:17:16 rocky kernel: [361650.587097] __schedule+0x2e6/0x6f0 Dec 28 17:17:16 rocky kernel: [361650.591884] ? signal_wake_up_state+0x19/0x30 Dec 28 17:17:16 rocky kernel: [361650.597637] schedule+0x33/0xa0 Dec 28 17:17:16 rocky kernel: [361650.602027] schedule_timeout+0x205/0x330 Dec 28 17:17:16 rocky kernel: [361650.607478] wait_for_completion+0xb7/0x140 Dec 28 17:17:16 rocky kernel: [361650.613097] ? wake_up_q+0x80/0x80 Dec 28 17:17:16 rocky kernel: [361650.617818] iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod] Dec 28 17:17:16 rocky kernel: [361650.625519] iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod] Dec 28 17:17:16 rocky kernel: [361650.634118] lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]
that makes me think that there is a race condition somewhere, causing a problem with the refcounting against the tpg_np structure so the iscsit_reset_np_thread() kernel function gets stuck in wait_for_completion()
Note that lio_target_call_delnpfromtpg() is called when you execute a command like "targetcli iscsi/ delete iqn...."
If you have new dmesg logs please post them here, they might help
Ok that sounds plausible. But we did previously try few times to make sure there were no connections to non-existant targets. And the issue did persist. But I guess there have always been at least a few attempts.
We only delete targets few times a day. But the boot script deletes&readds the LUN's every boot so there's more stress on that path.
Here is a recent kernel log. targetcli-hanging.log
I have 80 targets, each with 2 block devices using zfs clones from snapshot.
The issue starts usually after 3-7 days. When it starts, various commands hang indefinently.
Right now I'm running
/iscsi> delete iqn.2020-01.is.gz:192-168-101-163
and it's frozen indefinently.logs show
Running Proxmox 6.3 with kernel
5.4.78-2-pve
. Usingtargetcli-fb version 2.1.48-2