ipilcher / n5550

Thecus N5550 hardware support
GNU General Public License v2.0
8 stars 8 forks source link

soft lockup / hard lockup with n5550_ahci_leds #7

Closed mheubach closed 8 years ago

mheubach commented 9 years ago

Hi Ian,

I've got several Thecus N5550 as backup appliances out in the wild. They run with Ubuntu 14.04 Server and ZFS (ZoL) (8GB RAM and bigger SATA DOM). I only use the modules out of your package. For the display and the leds we use a little perl daemon.

All systems start getting soft lockup or even hard lockup problems from time to time.

The funny thing is that one system doesn't have this problems. After a short investigation I recognized that this system doesn't have the n5550_ahci_leds module correctly loaded.

All systems work under an extreme ioload except the one which hasn't the module loaded - so I'm not sure if there is a connection between the problems and the module at all.

On the other hand I took one system and removed the module - and the problems stopped at once. I have to keep an eye on this - perhaps it's only a coincidence but maybe other people experience the same problem.

Tomorrow I will leave for my summer holidays and won't check my mails for the next two weeks.

See you Manfred

Aug 11 18:32:04 gdb kernel: [21302.282598] BUG: soft lockup - CPU#0 stuck for 23s! [sh:17149] Aug 11 18:32:08 gdb kernel: [21306.310990] BUG: soft lockup - CPU#3 stuck for 23s! [z_wr_int/6:10890] Aug 11 18:34:28 gdb kernel: [21446.308619] BUG: soft lockup - CPU#2 stuck for 22s! [z_wr_int/0:10884] Aug 11 18:34:31 gdb kernel: [21449.773908] BUG: soft lockup - CPU#1 stuck for 26s! [zfs:17396] Aug 11 18:37:53 gdb kernel: [21618.313323] BUG: soft lockup - CPU#0 stuck for 22s! [z_wr_int/1:10885] Aug 11 18:37:53 gdb kernel: [21618.341322] BUG: soft lockup - CPU#3 stuck for 22s! [zfs:17683] Aug 11 18:37:53 gdb kernel: [21646.316039] BUG: soft lockup - CPU#0 stuck for 22s! [z_wr_int/1:10885] Aug 11 18:37:53 gdb kernel: [21646.344040] BUG: soft lockup - CPU#3 stuck for 22s! [zfs:17683] Aug 11 19:49:31 gdb kernel: [25950.203309] BUG: soft lockup - CPU#1 stuck for 40s! [atop:1848] Aug 11 19:50:32 gdb kernel: [26010.768321] BUG: soft lockup - CPU#3 stuck for 22s! [check_mk_agent:23267] Aug 11 19:50:36 gdb kernel: [26014.752707] BUG: soft lockup - CPU#2 stuck for 23s! [z_rd_int/2:618]

Aug 11 18:37:53 gdb kernel: [21634.641112] Watchdog detected hard LOCKUP on cpu 2 Aug 11 18:38:52 gdb kernel: [21693.691667] Watchdog detected hard LOCKUP on cpu 2 Aug 11 18:38:52 gdb kernel: [21709.116575] Watchdog detected hard LOCKUP on cpu 3 Aug 11 18:57:18 gdb kernel: [22816.233769] Watchdog detected hard LOCKUP on cpu 1 Aug 11 19:09:35 gdb kernel: [23537.900333] Watchdog detected hard LOCKUP on cpu 2 Aug 11 19:09:35 gdb kernel: [23548.181553] Watchdog detected hard LOCKUP on cpu 3 Aug 11 19:14:40 gdb kernel: [23852.151805] Watchdog detected hard LOCKUP on cpu 2 Aug 11 19:16:57 gdb kernel: [23977.578800] Watchdog detected hard LOCKUP on cpu 0 Aug 11 19:16:57 gdb kernel: [23991.782611] Watchdog detected hard LOCKUP on cpu 3 Aug 11 19:18:21 gdb kernel: [24057.616686] Watchdog detected hard LOCKUP on cpu 3 Aug 11 19:18:21 gdb kernel: [24068.593960] Watchdog detected hard LOCKUP on cpu 2 Aug 11 19:19:34 gdb kernel: [24131.814256] Watchdog detected hard LOCKUP on cpu 2 Aug 11 19:19:34 gdb kernel: [24142.906667] Watchdog detected hard LOCKUP on cpu 3 Aug 11 19:47:56 gdb kernel: [25839.707564] Watchdog detected hard LOCKUP on cpu 2 Aug 11 19:47:56 gdb kernel: [25849.789703] Watchdog detected hard LOCKUP on cpu 3 Aug 11 19:48:49 gdb kernel: [25870.261765] Watchdog detected hard LOCKUP on cpu 0 Aug 11 19:48:49 gdb kernel: [25889.715303] Watchdog detected hard LOCKUP on cpu 2 Aug 11 19:48:49 gdb kernel: [25899.801234] Watchdog detected hard LOCKUP on cpu 3 Aug 11 19:57:47 gdb kernel: [26445.699093] Watchdog detected hard LOCKUP on cpu 1

ipilcher commented 9 years ago

On 08/12/2015 05:51 AM, mheubach wrote:

On the other hand I took one system and removed the module - and the problems stopped at once. I have to keep an eye on this - perhaps it's only a coincidence but maybe other people experience the same problem.

Funky. I haven't heard from anyone having this sort of problem, but I don't know of anyone else using the modules in your configuration either. Unfortunately, debugging something like this is a bit beyond my skill set.

The one thing that does occur to me as a possible cause is CPU temperature, particularly since you mentioned that the systems exhibiting the problem are under more load. The cooling in the N5550 is fairly marginal. I actually replaced the built-in system fan with a 120mm fan, mounted externally with an 80mm-120mm adapter. It might be something to check if it turns out that removing the module doesn't solve the problem.

Please do let me know what you figure out

Ian Pilcher arequipeno@gmail.com

-------- "I grew up before Mark Zuckerberg invented friendship" --------

mheubach commented 8 years ago

Hi Ian,

I haven't had this problem for months now but stumbled upon it while preparing a thecus nas with only 2 GB RAM. For me it looks like under io pressure and insufficient RAM ZFS is claiming slabs faster than releasing them and by that causes a deadlock. I think this is an ZFS issue and has nothing to do with your code at all.

Manfred