4.9.68->4.9.70 develops hardware related issue landing in blk_mq_run_hw_queue

sempervictus commented 6 years ago

Putting this here for reference, as i'm not able to track this to the source at present. Somewhere in the last update cycle we found that a specific server in our datacenter can't access SSDs without tripping a pretty long stack trace which seems to mention scsi (but we couldn't catch it given the nature of the problem - host won't get to decrypting the OS volume, udev seems to kill it). The top of what we did catch was list_del_entry, blk_mq_run_hw_queue, blk_mq_run_hw_queue, blk_mq_insert_requests.constprop, blk_mq_flush_plug_list. It looks like something, specific to the drivers in that host, went south between .68 and .70.

Having noticed that the (potentially touchy, this is old gear) EFI subtree was touched, we reproduced the crash in BIOS mode as well.

Storage controllers are:

00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)
03:00.0 Serial Attached SCSI controller: Intel Corporation C606 chipset Dual 4-Port SATA/SAS Storage Control Unit (rev 06)

using the ahci and isci modules respectively. Somewhere between one of those controllers and the blk_mq function call, things go south.

Rolling back to 4.9.68 resolved the problem. A similarly old host using _uhcihcd for its Intel controller is unaffected (the LSI SAS controller in there is also fine, currently resilvering ~1T of ZFS data without issue).

Thanks as always for maintaining this tree, hope what little data we were able to pull along the way is helpful in debugging this. Will add more as we get it.

theLOICofFRANCE commented 6 years ago

Hi,

There have been upstream changes on blk in the version 4.9.69.

Try to cancel the blk commits affected by this changes.

sempervictus commented 6 years ago

What's odd is that its only one system, others with identical workloads but different hardware are OK. Unfortunately it's a production box backing openstack controller and cinder, so testing ops are limited. Will try again over the quiet week coming up and see if I can't trap the entire stack trace somehow.

minipli commented 6 years ago

There were a few changes to the blk-mq code in that "version frame". However, there has been another fix in .71. So could you please try if .71 or .72 fixes the issue for you? If not, a full backtrace is required for further debugging. A screenshot, at least.

sempervictus commented 6 years ago

@minipli: looks good now, was able to push .72 to the box this evening, just came up with a reboot using the very SLCs for its zpool intent logs which caused the panic in the first place, with no issue. The system runs a pair of KVM controllers for OpenStack clouds and a Cinder volume node in LXC from atop a zpool residing on both the storage controllers for different vdev types. In other words, this will get a good "workout" to determine if there's still a gremlin or two around. I'll close this out for now, reopen it if this comes back and i can pull a proper stack trace out of the pstore or somewhere. Thanks as always.

minipli / linux-unofficial_grsec

4.9.68->4.9.70 develops hardware related issue landing in blk_mq_run_hw_queue #22