[BUG] targetcli commands hanging

karibertils commented 3 years ago

I have 80 targets, each with 2 block devices using zfs clones from snapshot.

The issue starts usually after 3-7 days. When it starts, various commands hang indefinently.

Right now I'm running /iscsi> delete iqn.2020-01.is.gz:192-168-101-163 and it's frozen indefinently.

logs show

[361892.210559] INFO: task targetcli:3414923 blocked for more than 362 seconds.
[361892.219399]       Tainted: P           O      5.4.78-2-pve #1                          
[361892.226796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[361892.236489] targetcli       D    0 3414923 3406117 0x00004000                          
[361892.243839] Call Trace:                    
[361892.247510]  __schedule+0x2e6/0x6f0                                                    
[361892.252312]  ? signal_wake_up_state+0x19/0x30
[361892.258079]  schedule+0x33/0xa0                                                        
[361892.262472]  schedule_timeout+0x205/0x330  
[361892.267868]  wait_for_completion+0xb7/0x140                                            
[361892.273442]  ? wake_up_q+0x80/0x80         
[361892.278144]  iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod]                       
[361892.285765]  iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod]
[361892.294186]  lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]                 
[361892.302411]  target_fabric_np_base_release+0x31/0x40 [target_core_mod]
[361892.310633]  config_item_release+0x62/0xd0                                             
[361892.316042]  config_item_put+0x29/0x30     
[361892.321044]  configfs_rmdir+0x1bd/0x300                                                
[361892.326155]  vfs_rmdir+0x81/0x180          
[361892.330658]  do_rmdir+0x1b6/0x210                                                      
[361892.335137]  __x64_sys_rmdir+0x17/0x20     
[361892.340166]  do_syscall_64+0x57/0x190                                                  
[361892.345599]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[361892.352031] RIP: 0033:0x7f77999cb027                                                   
[361892.356797] Code: Bad RIP value.           
[361892.361159] RSP: 002b:00007ffd828179c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054     
[361892.370411] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00007f77999cb027
[361892.379189] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f7796820e50          
[361892.387956] RBP: 0000000001952f30 R08: 0000000000000000 R09: 0000000000761090
[361892.396716] R10: 000000000076108f R11: 0000000000000246 R12: 00007f77992ed408          
[361892.405426] R13: 00007f7797995d60 R14: 00000000005ff420 R15: 0000000001952f30

Running Proxmox 6.3 with kernel 5.4.78-2-pve. Using targetcli-fb version 2.1.48-2

maurizio-lombardi commented 3 years ago

Hello, thanks for the report. This looks like a kernel bug in the target driver, I will look at it soon

maurizio-lombardi commented 3 years ago

@karibertils Can you please post a link to the entire dmesg output?

karibertils commented 3 years ago

@maurizio-lombardi Here is the log for the whole day when issue occurred last time. kernel.2020.12.28.log

There were some PC's trying to attach non existing targets at the time. Don't know if that could be related to the issue ?

I have been rebooting every other day since to avoid the issue. I can skip reboots to gather new logs if that helps. Maybe enable some debug mode also ?

karibertils commented 3 years ago

No news on this bug ?

I have tried also Ubuntu Server 20.10. Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also.

maurizio-lombardi commented 3 years ago

@karibertils Hi, I am trying to reproduce it.

"Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also."

I am sure the bug is still present in the latest upstream kernel. Question: it happens when a number of initiators try to connect to a non-existing target and in the meanwhile you delete another target via targetcli? Correct?

karibertils commented 3 years ago

@maurizio-lombardi No I believe it happens regardless of initiators connecting to non-existing targets or not.

When the issue starts every targetcli command hangs. Doing targetcli ls or even just targetcli to open the cli interface hangs.

I am network booting 80 PC's. All of them have 1 target which has 2 LUN's. LUN 0 is almost never changed. But LUN 1 is removed and re-added every time the PC's boot.

example:

targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns/ delete lun=1'
targetcli '/backstores/block/ delete games-192-168-101-152'
zfs destroy -R speedy/games/192-168-101-152
zfs destroy -R speedy/games/master@192-168-101-152
zfs snapshot speedy/games/master@192-168-101-152
zfs clone speedy/games/master@192-168-101-152 speedy/games/192-168-101-152
targetcli '/backstores/block/ create games-192-168-101-152 /dev/zvol/speedy/games/192-168-101-152'
targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns create lun=1 /backstores/block/games-192-168-101-152'

We did previously remove and re-add LUN 1 once every 24 hours. And the issue happened with similar frequency then. It can happen after running for 1-9 days. Usually it takes 3+ days though.

maurizio-lombardi commented 3 years ago

Hmm, I asked because of the following backtrace

Dec 28 17:17:16 rocky kernel: [361650.549542] INFO: task targetcli:3414923 blocked for more than 120 seconds. Dec 28 17:17:16 rocky kernel: [361650.558420] Tainted: P O 5.4.78-2-pve #1 Dec 28 17:17:16 rocky kernel: [361650.566177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 28 17:17:16 rocky kernel: [361650.575912] targetcli D 0 3414923 3406117 0x00004000 Dec 28 17:17:16 rocky kernel: [361650.583247] Call Trace: Dec 28 17:17:16 rocky kernel: [361650.587097] __schedule+0x2e6/0x6f0 Dec 28 17:17:16 rocky kernel: [361650.591884] ? signal_wake_up_state+0x19/0x30 Dec 28 17:17:16 rocky kernel: [361650.597637] schedule+0x33/0xa0 Dec 28 17:17:16 rocky kernel: [361650.602027] schedule_timeout+0x205/0x330 Dec 28 17:17:16 rocky kernel: [361650.607478] wait_for_completion+0xb7/0x140 Dec 28 17:17:16 rocky kernel: [361650.613097] ? wake_up_q+0x80/0x80 Dec 28 17:17:16 rocky kernel: [361650.617818] iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod] Dec 28 17:17:16 rocky kernel: [361650.625519] iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod] Dec 28 17:17:16 rocky kernel: [361650.634118] lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]

that makes me think that there is a race condition somewhere, causing a problem with the refcounting against the tpg_np structure so the iscsit_reset_np_thread() kernel function gets stuck in wait_for_completion()

Note that lio_target_call_delnpfromtpg() is called when you execute a command like "targetcli iscsi/ delete iqn...."

If you have new dmesg logs please post them here, they might help

karibertils commented 3 years ago

Ok that sounds plausible. But we did previously try few times to make sure there were no connections to non-existant targets. And the issue did persist. But I guess there have always been at least a few attempts.

We only delete targets few times a day. But the boot script deletes&readds the LUN's every boot so there's more stress on that path.

Here is a recent kernel log. targetcli-hanging.log

open-iscsi / targetcli-fb

[BUG] targetcli commands hanging #178