openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.34k stars 1.72k forks source link

ZFS io error when disks are in idle/standby/spindown mode #4713

Closed johnkeates closed 5 years ago

johnkeates commented 8 years ago

Whenever one or more disks in one of my pools is sleeping because it was idle ZFS (via ZED) spams me with IO errors (via email, because that's how I set it up).

It's always this kind of error with only the vpath, vguid and eid changing:

ZFS has detected an io error:

  eid: 15
class: io
 host: clava
 time: 2016-05-29 23:21:18+0200
vtype: disk
vpath: /dev/disk/by-id/ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5736249-part1
vguid: 0x0094F35F53B1888B
cksum: 0
 read: 0
write: 0
 pool: green pool

dmesg shows:

[ 3647.748383] sd 1:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 3647.748386] sd 1:0:0:0: [sdd] tag#0 CDB: Read(10) 28 00 64 9d ac c8 00 00 08 00
[ 3647.748388] blk_update_request: I/O error, dev sdd, sector 1688054984
[ 3647.748401] sd 1:0:1:0: [sde] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 3647.748402] sd 1:0:1:0: [sde] tag#1 CDB: Read(10) 28 00 b4 26 aa 70 00 00 08 00
[ 3647.748403] blk_update_request: I/O error, dev sde, sector 3022432880
[ 3647.748408] sd 1:0:3:0: [sdg] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 3647.748409] sd 1:0:3:0: [sdg] tag#2 CDB: Read(10) 28 00 b4 26 ca 78 00 00 08 00
[ 3647.748410] blk_update_request: I/O error, dev sdg, sector 3022441080
[ 3655.074695] sd 1:0:2:0: [sdf] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 3655.074699] sd 1:0:2:0: [sdf] tag#8 CDB: Read(10) 28 00 64 9d b8 c0 00 00 08 00
[ 3655.074700] blk_update_request: I/O error, dev sdf, sector 1688058048
[ 3655.074712] sd 1:0:2:0: [sdf] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 3655.074713] sd 1:0:2:0: [sdf] tag#10 Sense Key : Not Ready [current] 
[ 3655.074715] sd 1:0:2:0: [sdf] tag#10 Add. Sense: Logical unit not ready, initializing command required
[ 3655.074716] sd 1:0:2:0: [sdf] tag#10 CDB: Read(10) 28 00 64 9d 80 e8 00 00 08 00
[ 3655.074717] blk_update_request: I/O error, dev sdf, sector 1688043752
[ 3655.074721] sd 1:0:2:0: [sdf] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 3655.074722] sd 1:0:2:0: [sdf] tag#13 Sense Key : Not Ready [current] 
[ 3655.074723] sd 1:0:2:0: [sdf] tag#13 Add. Sense: Logical unit not ready, initializing command required
[ 3655.074724] sd 1:0:2:0: [sdf] tag#13 CDB: Read(10) 28 00 64 9d 90 60 00 00 08 00
[ 3655.074725] blk_update_request: I/O error, dev sdf, sector 1688047712

Scrubbing gives no 'fixes', as there is no data corruption or pools that get unhappy, just a few errors. As far as I can see, either ZFS isn't waiting long enough for the disks to spin up (they actually spin up on access), or tries some command before checking the disk is ready for it.

The pool status:

john@clava:~$ sudo zpool status green pool
  pool: green pool
 state: ONLINE
  scan: scrub repaired 0 in 19h28m with 0 errors on Sun May 29 17:09:56 2016
config:

    NAME                                          STATE     READ WRITE CKSUM
    greenpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5757832  ONLINE       0     0     0
        ata-WDC_WD20EARX-00PASB0_WD-WCAZA8848843  ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD20EARX-00PASB0_WD-WCAZA8841762  ONLINE       0     0     0
        ata-WDC_WD20EARS-00MVWB0_WD-WCAZA5736249  ONLINE       0     0     0

errors: No known data errors

I can disable spindown/standby, but not all pools are always in use, some are only archives. I enabled standby timeouts before the IO errors came along, so to me, it sounds like ZFS or ZoL doesn't deal with spindown or standby very well?

Additional data:

Using version 0.6.5.6-2 and with Linux 4.5.0-2-amd64.

hoppel118 commented 6 years ago

Hello @brianmduncan ,

thanks for reporting your success. I also want to try it.

Can you describe exactly what you did to update your lsi3008 to the latest it firmware (v15)?

Thanks and greetings Hoppel

crimp42 commented 6 years ago

Actually I have been meaning to reply to this thread. In the end it did NOT fix the issue.

After my initial success I thought I had it fixed and moved on with things, noticed one of those errors in my logs while troubleshooting things a couple weeks ago and started looking closer. I did some tests last weekend and I would say that it is definitely less, but is still present.

Sorry for the false hope. I was hoping to do more testing but in the end I am still guessing whatever got changed from Kernel 2.x, to 3.x (which is still present in 4.x) until that is worked around we will have to deal with this. And I moved to 10 Gig network cards that require 3.x/4.x kernel now, so I don't have the ability to move back to a 2.x kernel.

On Thu, Jan 18, 2018 at 6:06 PM, hoppel118 notifications@github.com wrote:

Hello @brianmduncan https://github.com/brianmduncan ,

thanks for reporting your success. I also want to try it.

Can you describe what you did to update your lsi3008 exactly?

Thanks and greetings Hoppel

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-358823916, or mute the thread https://github.com/notifications/unsubscribe-auth/AEegdgJQWF39MItLE_KouKyY4gFI6FISks5tL9yagaJpZM4IpYmi .

hoppel118 commented 6 years ago

Uff.., thanks for responding that fast, sadly with not so good news.

I am still considering to upgrade my lsi 3008. Can you please describe in some words what you did to upgrade the controller. Found some howtos, but I am not sure if they are complete.

Greetings Hoppel

crimp42 commented 6 years ago

Actually I don't recall now exactly. It was back in November. I know I downloaded this file (which I assume I got from the broadcom site at https://www.broadcom.com/products/storage/host-bus-adapters/sas-9300-8i ) 9300_8i_Package_P15_IR_IT_FW_BIOS_for_MSDOS_Windows.zip I extracted the firmware/bios to a bootable USB drive I had around from when I would flash my LSI 9211-8i's I do recall having to look around also for a uefi flasher to copy over to my USB drive I was booting off, then from what I recall I just flashed the IT firmware with sas3flash. I think I used the file from SAS3FLASH_P15.zip.

Sorry can't be more specific. I do this so rarely, that I don't recall the exact steps. I just took a look at the USB drive I have that I used back then and I put 9300.bin and 9300.rom and sas3flash.efi on there. I am guessing I renamed the bios to 9300.rom and the firmware to 9300.bin I got from the zip so when I ran sasflash3 from the uefi shell I would not have to remember the lengthy default names. I also recall flashing each one with a different switch.

cobrafast commented 6 years ago

I havn't had any problems with my Dell SAS2008 (9211-8i fw) card since I patched the kernel the way I described earlier in this thread.

However, I want to add that Broadcom offers a script collection that gathers a lot of debug data that can be inspected with regular tools (i.e. text editors): https://www.broadcom.com/support/knowledgebase/1211161499563/lsiget-data-capture-script

I didn't find anything useful with it (probably also because a lot of the stuff it collects is either very generic or very technical), but maybe you guys have better luck with it.

The Broadcom rep who pointed me to it also recommended flashing the card to MegaRAID mode first, which I didn't do.

hoppel118 commented 6 years ago

Hi guys,

I made the update of my lsi3008 to the latest it firmware and checked if the issue still persists. Sadly it's still there. I documented my update procedure in the following thread in German language:

https://forums.freenas.org/index.php?threads/how-do-flashanleitung-lsi-sas3008-auf-mainboard-supermicro-x11ssh-ctf-%C3%BCber-integrierte-uefi-shell.42558/

Don't wonder about the forum. I don't use freenas, I use openmediavault 4 with the following basis:

root@omv4:~# cat /etc/debian_version
9.3
root@omv4:~# uname -a
Linux omv4 4.14.0-0.bpo.2-amd64 #1 SMP Debian 4.14.7-1~bpo9+1 (2017-12-22) x86_64 GNU/Linux
root@omv4:~# modinfo zfs
filename:       /lib/modules/4.14.0-0.bpo.2-amd64/updates/dkms/zfs.ko
version:        0.7.5-1~bpo9+1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     9C78552EF2E79ADAE6389FB
depends:        spl,znvpair,zcommon,zunicode,zavl,icp
name:           zfs
vermagic:       4.14.0-0.bpo.2-amd64 SMP mod_unload modversions
root@omv4:~# dmesg | grep mpt3sas
**[    1.191166] mpt3sas version 15.100.00.00 loaded**
[    1.192677] mpt3sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (65928596 kB)
[    1.243182] mpt3sas_cm0: MSI-X vectors supported: 96, no of cores: 8, max_msix_vectors: -1
[    1.243691] mpt3sas0-msix0: PCI-MSI-X enabled: IRQ 31
[    1.243692] mpt3sas0-msix1: PCI-MSI-X enabled: IRQ 32
[    1.243694] mpt3sas0-msix2: PCI-MSI-X enabled: IRQ 33
[    1.243695] mpt3sas0-msix3: PCI-MSI-X enabled: IRQ 34
[    1.243696] mpt3sas0-msix4: PCI-MSI-X enabled: IRQ 35
[    1.243697] mpt3sas0-msix5: PCI-MSI-X enabled: IRQ 36
[    1.243698] mpt3sas0-msix6: PCI-MSI-X enabled: IRQ 37
[    1.243699] mpt3sas0-msix7: PCI-MSI-X enabled: IRQ 38
[    1.243702] mpt3sas_cm0: iomem(0x00000000df240000), mapped(0xffffbf3e86da0000), size(65536)
[    1.243703] mpt3sas_cm0: ioport(0x000000000000e000), size(256)
[    1.308105] mpt3sas_cm0: sending message unit reset !!
[    1.309657] mpt3sas_cm0: message unit reset: SUCCESS
[    1.373123] mpt3sas_cm0: Allocated physical memory: size(8482 kB)
[    1.373125] mpt3sas_cm0: Current Controller Queue Depth(2936),Max Controller Queue Depth(3072)
[    1.373126] mpt3sas_cm0: Scatter Gather Elements per IO(128)
[    1.424173] mpt3sas_cm0: LSISAS3008: FWVersion(15.00.03.00), ChipRevision(0x02), BiosVersion(08.35.00.00)
[    1.424175] mpt3sas_cm0: Protocol=(
[    1.425054] mpt3sas_cm0: sending port enable !!
[    1.427858] mpt3sas_cm0: host_add: handle(0x0001), sas_addr(0x5003048002c46300), phys(8)
[    1.444105] mpt3sas_cm0: port enable: SUCCESS

As you can see, everything is really fresh and up-to-date. ;)

Greetings Hoppel

h1z1 commented 6 years ago

[ 1.373125] mpt3sas_cm0: Current Controller Queue Depth(2936),Max Controller Queue Depth(3072)

Curious. Is write caching enabled and what block scheduler ?

hoppel118 commented 6 years ago

Write cache is disabled:

quiet /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off } /dev/disk/by-id/ata-WDC_WD40EFRX-68WT0N0_WD-WCCXXXXXXXXX { write_cache = off }

Where do I find information about the block scheduler?

Greetings Hoppel

h1z1 commented 6 years ago

Where do I find information about the block scheduler?

Should be under /sys/block/sd?/queue/scheduler

hoppel118 commented 6 years ago

Hello @h1z1

root@omv4:~# ls -l /sys/block/sd?/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sda/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdb/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdc/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdd/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sde/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdf/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdg/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdh/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdi/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdj/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdk/queue/scheduler -rw-r--r-- 1 root root 4096 Feb 7 12:47 /sys/block/sdl/queue/scheduler

root@omv4:~ # cat /sys/block/sda/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdb/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdc/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdd/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sde/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdf/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdg/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdh/queue/scheduler [noop] deadline cfq root@omv4:~ # cat /sys/block/sdi/queue/scheduler noop [deadline] cfq root@omv4:~ # cat /sys/block/sdj/queue/scheduler noop [deadline] cfq root@omv4:~ # cat /sys/block/sdk/queue/scheduler noop [deadline] cfq root@omv4:~ # cat /sys/block/sdl/queue/scheduler noop [deadline] cfq

Sorry, no plan how block scheduler can help with this issue.

What shall I do/configure?

Greetings Hoppel

flo82 commented 6 years ago

i switched from a Broadcom SAS2008 to a marvel 88SE9485 chipset and the problem is gone. Seems to be a kernel problem issue.

The problem described above only occurs if a controller which uses "mpt2sas" kernel mode is installed. HBA controllers which use the "mvsas" kernel module work fine.

Tested with ubuntu 16.04. and kernel 4.4.0-112.

hoppel118 commented 6 years ago

Yeah, but changing to another chipset is no solution for me! :)

Regards Hoppel

hoppel118 commented 6 years ago

@behlendorf Do you think to get this sorted until milestone 0.8?

Greetings Hoppel

red-scorp commented 6 years ago

Same problem on Z87 Extreme11/ac -> 22 x SATA3 (16 x SAS3 12.0 Gb/s + 6 x SATA3 6.0 Gb/s) from LSI SAS 3008 Controller+ 3X24R Expander

OS: Ubuntu 18.04 dev

$ cat /etc/issue
Ubuntu Bionic Beaver (development branch) \n \l
$ uname -a
Linux AGVault 4.15.0-12-generic #13-Ubuntu SMP Thu Mar 8 06:24:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ dpkg -l zfs*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================
un  zfs            <none>       <none>       (no description available)
un  zfs-dkms       <none>       <none>       (no description available)
un  zfs-dracut     <none>       <none>       (no description available)
un  zfs-fuse       <none>       <none>       (no description available)
un  zfs-initramfs  <none>       <none>       (no description available)
un  zfs-modules    <none>       <none>       (no description available)
ii  zfs-zed        0.7.5-1ubunt amd64        OpenZFS Event Daemon
un  zfsutils       <none>       <none>       (no description available)
ii  zfsutils-linux 0.7.5-1ubunt amd64        command-line tools to manage Open

ZFS hangs on spin-up of SATA HDDs. So I assume It's a problem between LSI controller driver and ZFS. mpt3sas 17.100.00.00

I'll try BIOS updates, let's see if it fix the problem

UPDATE: I've updated MB BIOS and flash SAS Contoller to IT mode with a newest available FW from 9300 card. This did not help with the disk spin-up problem. funny enough it's not only the ZFS freezes but hddtemp and smartctl too. This issue might be related not to ZFS but to misbehavior of mpt3sas itself.

Please let me know if you found any solution or workarounds to the freezing disk on spin-up? Thanks in advance!

da-tex commented 6 years ago

We're affected, to.

This is happening since we upgraded from jessie to stretch (to 4.xx kernel) a few month ago and persists after a few kernel upgrades.

# uname -r
4.9.0-6-amd64

and

# lspci | grep SCSI
03:00.0 Serial Attached SCSI controller: Adaptec PMC-Sierra PM8001 SAS HBA [Series 6H] (rev 05)
05:00.0 Serial Attached SCSI controller: Adaptec PMC-Sierra PM8001 SAS HBA [Series 6H] (rev 05)

If you need specific information, we'd be happy to provide it, since this mail always scares the IT-staff (because it could be something serious).

satmandu commented 6 years ago

For what it is worth, i'm not getting any more of these read errors when I modify the kernel boot parameters as follows in /etc/default/grub, run update-grub, and reboot:

GRUB_CMDLINE_LINUX_DEFAULT="mpt3sas.msix_disable=1"

(Running drives off of a LSI SAS2008 controller with zfs 0.7.8 on a 4.15.17 kernel in a Ubuntu 18.04 system.)

Previously I was getting tons of these errors on a mirror array:

[ 1724.989700] sd 0:0:4:0: [sde] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 1724.989706] sd 0:0:4:0: [sde] tag#1 CDB: Read(16) 88 00 00 00 00 02 6b b8 de 40 00 00 00 08 00 00
[ 1724.989710] print_req_error: I/O error, dev sde, sector 10397212224
[ 1895.361198] sd 0:0:5:0: [sdf] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 1895.361205] sd 0:0:5:0: [sdf] tag#4 CDB: Read(16) 88 00 00 00 00 02 6b b8 de 40 00 00 00 08 00 00
[ 1895.361209] print_req_error: I/O error, dev sdf, sector 10397212224

(Now waiting for scrubs to finish to figure out what I need to restore from backup. :/ )

da-tex commented 6 years ago

What are the consequences of setting GRUB_CMDLINE_LINUX_DEFAULT="mpt3sas.msix_disable=1"?

And is there something similar for the controller we use?

This ist what /var/log/messages says, if one of this errors occurs:

Apr 17 10:45:43 server kernel: [2065045.277260] pm80xx pm8001_mpi_task_abort_resp 3742:task abort failed status 0x6 ,tag = 0x1, scp= 0x0
Apr 17 10:45:43 server kernel: [2065045.277442] pm80xx pm8001_mpi_task_abort_resp 3742:task abort failed status 0x6 ,tag = 0x1, scp= 0x0
Apr 17 10:45:43 server kernel: [2065045.277453] pm80xx pm8001_abort_task 1231:rc= 5
Apr 17 10:45:43 server kernel: [2065045.285750] ata2: hard resetting link
Apr 17 10:45:43 server kernel: [2065045.445864] ata2.00: supports DRM functions and may not be fully accessible
Apr 17 10:45:43 server kernel: [2065045.446969] ata2.00: supports DRM functions and may not be fully accessible
Apr 17 10:45:43 server kernel: [2065045.446976] ata2.00: configured for UDMA/133
Apr 17 10:45:43 server kernel: [2065045.446993] sd 1:0:1:0: [sdg] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 17 10:45:43 server kernel: [2065045.446998] sd 1:0:1:0: [sdg] tag#1 Sense Key : Illegal Request [current] 
Apr 17 10:45:43 server kernel: [2065045.447001] sd 1:0:1:0: [sdg] tag#1 Add. Sense: Unaligned write command
Apr 17 10:45:43 server kernel: [2065045.447006] sd 1:0:1:0: [sdg] tag#1 CDB: Write(16) 8a 00 00 00 00 00 aa 20 4c c0 00 00 00 10 00 00
Apr 17 10:45:43 server kernel: [2065045.448629] ata2: EH complete

Can someone check, if that's this issue or something else (maybe hardware related)?

satmandu commented 6 years ago

If it IS a driver issue with the mpt2/3sas driver causing an i/o issue, it would be nice to not blame zfs for the underlying problem.

(But @da-tex your error looks different from mine, but definitely looks like a driver or HBA or disk issue. I saw that @splitice was having an issue similar to my issue and thought to mention my fix.)

FYI here is how to check your module options:

modinfo mpt3sas
filename:       /lib/modules/4.15.17-041517-generic/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
alias:          mpt2sas
version:        17.100.00.00
license:        GPL
description:    LSI MPT Fusion SAS 3.0 Device Driver
author:         Avago Technologies <MPT-FusionLinux.pdl@avagotech.com>
srcversion:     A47B3D9EA19783E3ABF8C0C
alias:          pci:v00001000d000000D1sv*sd*bc*sc*i*
alias:          pci:v00001000d000000ACsv*sd*bc*sc*i*
alias:          pci:v00001000d000000ABsv*sd*bc*sc*i*
alias:          pci:v00001000d000000AAsv*sd*bc*sc*i*
alias:          pci:v00001000d000000AFsv*sd*bc*sc*i*
alias:          pci:v00001000d000000AEsv*sd*bc*sc*i*
alias:          pci:v00001000d000000ADsv*sd*bc*sc*i*
alias:          pci:v00001000d000000C3sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C2sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C1sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C0sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C8sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C7sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C6sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C5sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C4sv*sd*bc*sc*i*
alias:          pci:v00001000d000000C9sv*sd*bc*sc*i*
alias:          pci:v00001000d00000095sv*sd*bc*sc*i*
alias:          pci:v00001000d00000094sv*sd*bc*sc*i*
alias:          pci:v00001000d00000091sv*sd*bc*sc*i*
alias:          pci:v00001000d00000090sv*sd*bc*sc*i*
alias:          pci:v00001000d00000097sv*sd*bc*sc*i*
alias:          pci:v00001000d00000096sv*sd*bc*sc*i*
alias:          pci:v00001000d0000007Esv*sd*bc*sc*i*
alias:          pci:v00001000d0000006Esv*sd*bc*sc*i*
alias:          pci:v00001000d00000087sv*sd*bc*sc*i*
alias:          pci:v00001000d00000086sv*sd*bc*sc*i*
alias:          pci:v00001000d00000085sv*sd*bc*sc*i*
alias:          pci:v00001000d00000084sv*sd*bc*sc*i*
alias:          pci:v00001000d00000083sv*sd*bc*sc*i*
alias:          pci:v00001000d00000082sv*sd*bc*sc*i*
alias:          pci:v00001000d00000081sv*sd*bc*sc*i*
alias:          pci:v00001000d00000080sv*sd*bc*sc*i*
alias:          pci:v00001000d00000065sv*sd*bc*sc*i*
alias:          pci:v00001000d00000064sv*sd*bc*sc*i*
alias:          pci:v00001000d00000077sv*sd*bc*sc*i*
alias:          pci:v00001000d00000076sv*sd*bc*sc*i*
alias:          pci:v00001000d00000074sv*sd*bc*sc*i*
alias:          pci:v00001000d00000072sv*sd*bc*sc*i*
alias:          pci:v00001000d00000070sv*sd*bc*sc*i*
depends:        scsi_transport_sas,raid_class
retpoline:      Y
intree:         Y
name:           mpt3sas
vermagic:       4.15.17-041517-generic SMP mod_unload 
signat:         PKCS#7
signer:         
sig_key:        
sig_hashalgo:   md4
parm:           logging_level: bits for enabling additional logging info (default=0)
parm:           max_sectors:max sectors, range 64 to 32767  default=32767 (ushort)
parm:           missing_delay: device missing delay , io missing delay (array of int)
parm:           max_lun: max lun, default=16895  (ullong)
parm:           hbas_to_enumerate: 0 - enumerates both SAS 2.0 & SAS 3.0 generation HBAs
          1 - enumerates only SAS 2.0 generation HBAs
          2 - enumerates only SAS 3.0 generation HBAs (default=0) (ushort)
parm:           diag_buffer_enable: post diag buffers (TRACE=1/SNAPSHOT=2/EXTENDED=4/default=0) (int)
parm:           disable_discovery: disable discovery  (int)
parm:           prot_mask: host protection capabilities mask, def=7  (int)
parm:           max_queue_depth: max controller queue depth  (int)
parm:           max_sgl_entries: max sg entries  (int)
parm:           msix_disable: disable msix routed interrupts (default=0) (int)
parm:           smp_affinity_enable:SMP affinity feature enable/disbale Default: enable(1) (int)
parm:           max_msix_vectors: max msix vectors (int)
parm:           mpt3sas_fwfault_debug: enable detection of firmware fault and halt firmware - (default=0)
s

And what I see now from my boot:

dmesg | grep mpt2sas
[    1.243781] mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (16309992 kB)
[    1.298625] mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[    1.299057] mpt2sas0: IO-APIC enabled: IRQ 16
[    1.299467] mpt2sas_cm0: iomem(0x00000000904c0000), mapped(0x00000000b9ab35ab), size(16384)
[    1.299895] mpt2sas_cm0: ioport(0x000000000000e000), size(256)
[    1.356072] mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[    1.394442] mpt2sas_cm0: Allocated physical memory: size(7579 kB)
[    1.394893] mpt2sas_cm0: Current Controller Queue Depth(3364),Max Controller Queue Depth(3432)
[    1.395348] mpt2sas_cm0: Scatter Gather Elements per IO(128)
[    1.443456] mpt2sas_cm0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.02.04.00)
[    1.443918] mpt2sas_cm0: Protocol=(
[    1.449685] mpt2sas_cm0: sending port enable !!
[    3.046658] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x5000000080000000), phys(8)
[    8.932108] mpt2sas_cm0: port enable: SUCCESS
crimp42 commented 6 years ago

While I do not use ZFS, I have been having this issue. Since Kernel 3.x (Kernel 2.x 0 issues of this nature)

After testing for several days now, I believe that adding this

GRUB_CMDLINE_LINUX_DEFAULT="mpt3sas.msix_disable=1"

Rebuilding grub config/rebooting.

Has made this issue occur even less then before. On like 12 drives that all go to sleep (and are all accessed at some point during the day, causing numerous waking)

The problem is still not gone 100%, but I went back through my logs from the last month and I would guess that this issue is occurring now at 1/10 the frequency it was before. Some drives it has not occurred at all in the last 4 days. I have a feeling it is tied to how long the drive takes to spin up and be ready after waking.

I have 2 servers running a 4.x kernel, only tried this on one that has the SAS3008 card. Will try on my other server with SAS2008 soon. I assume then I would use mpt2sas.msix_disable

Thanks to the person who found that, I had never run across that boot time parameter in all my searching.

On Tue, Apr 17, 2018 at 8:29 AM, satmandu notifications@github.com wrote:

If it IS a driver issue with the mpt2/3sas driver causing an i/o issue, it would be nice to not blame zfs for the underlying problem.

(But @da-tex https://github.com/da-tex your error looks different from mine, but definitely looks like a driver or HBA or disk issue. I saw that @splitice https://github.com/splitice was having an issue similar to my issue and thought to mention my fix.)

FYI here is how to check your module options:

modinfo mpt3sas filename: /lib/modules/4.15.17-041517-generic/kernel/drivers/scsi/mpt3sas/mpt3sas.ko alias: mpt2sas version: 17.100.00.00 license: GPL description: LSI MPT Fusion SAS 3.0 Device Driver author: Avago Technologies MPT-FusionLinux.pdl@avagotech.com srcversion: A47B3D9EA19783E3ABF8C0C alias: pci:v00001000d000000D1svsdbcsci alias: pci:v00001000d000000ACsvsdbcsci alias: pci:v00001000d000000ABsvsdbcsci alias: pci:v00001000d000000AAsvsdbcsci alias: pci:v00001000d000000AFsvsdbcsci alias: pci:v00001000d000000AEsvsdbcsci alias: pci:v00001000d000000ADsvsdbcsci alias: pci:v00001000d000000C3svsdbcsci alias: pci:v00001000d000000C2svsdbcsci alias: pci:v00001000d000000C1svsdbcsci alias: pci:v00001000d000000C0svsdbcsci alias: pci:v00001000d000000C8svsdbcsci alias: pci:v00001000d000000C7svsdbcsci alias: pci:v00001000d000000C6svsdbcsci alias: pci:v00001000d000000C5svsdbcsci alias: pci:v00001000d000000C4svsdbcsci alias: pci:v00001000d000000C9svsdbcsci alias: pci:v00001000d00000095svsdbcsci alias: pci:v00001000d00000094svsdbcsci alias: pci:v00001000d00000091svsdbcsci alias: pci:v00001000d00000090svsdbcsci alias: pci:v00001000d00000097svsdbcsci alias: pci:v00001000d00000096svsdbcsci alias: pci:v00001000d0000007Esvsdbcsci alias: pci:v00001000d0000006Esvsdbcsci alias: pci:v00001000d00000087svsdbcsci alias: pci:v00001000d00000086svsdbcsci alias: pci:v00001000d00000085svsdbcsci alias: pci:v00001000d00000084svsdbcsci alias: pci:v00001000d00000083svsdbcsci alias: pci:v00001000d00000082svsdbcsci alias: pci:v00001000d00000081svsdbcsci alias: pci:v00001000d00000080svsdbcsci alias: pci:v00001000d00000065svsdbcsci alias: pci:v00001000d00000064svsdbcsci alias: pci:v00001000d00000077svsdbcsci alias: pci:v00001000d00000076svsdbcsci alias: pci:v00001000d00000074svsdbcsci alias: pci:v00001000d00000072svsdbcsci alias: pci:v00001000d00000070svsdbcsci depends: scsi_transport_sas,raid_class retpoline: Y intree: Y name: mpt3sas vermagic: 4.15.17-041517-generic SMP mod_unload signat: PKCS#7 signer: sig_key: sig_hashalgo: md4 parm: logging_level: bits for enabling additional logging info (default=0) parm: max_sectors:max sectors, range 64 to 32767 default=32767 (ushort) parm: missing_delay: device missing delay , io missing delay (array of int) parm: max_lun: max lun, default=16895 (ullong) parm: hbas_to_enumerate: 0 - enumerates both SAS 2.0 & SAS 3.0 generation HBAs 1 - enumerates only SAS 2.0 generation HBAs 2 - enumerates only SAS 3.0 generation HBAs (default=0) (ushort) parm: diag_buffer_enable: post diag buffers (TRACE=1/SNAPSHOT=2/EXTENDED=4/default=0) (int) parm: disable_discovery: disable discovery (int) parm: prot_mask: host protection capabilities mask, def=7 (int) parm: max_queue_depth: max controller queue depth (int) parm: max_sgl_entries: max sg entries (int) parm: msix_disable: disable msix routed interrupts (default=0) (int) parm: smp_affinity_enable:SMP affinity feature enable/disbale Default: enable(1) (int) parm: max_msix_vectors: max msix vectors (int) parm: mpt3sas_fwfault_debug: enable detection of firmware fault and halt firmware - (default=0) s

And what I see now from my boot:

dmesg | grep mpt2sas [ 1.243781] mpt2sas_cm0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (16309992 kB) [ 1.298625] mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k [ 1.299057] mpt2sas0: IO-APIC enabled: IRQ 16 [ 1.299467] mpt2sas_cm0: iomem(0x00000000904c0000), mapped(0x00000000b9ab35ab), size(16384) [ 1.299895] mpt2sas_cm0: ioport(0x000000000000e000), size(256) [ 1.356072] mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k [ 1.394442] mpt2sas_cm0: Allocated physical memory: size(7579 kB) [ 1.394893] mpt2sas_cm0: Current Controller Queue Depth(3364),Max Controller Queue Depth(3432) [ 1.395348] mpt2sas_cm0: Scatter Gather Elements per IO(128) [ 1.443456] mpt2sas_cm0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.02.04.00) [ 1.443918] mpt2sas_cm0: Protocol=( [ 1.449685] mpt2sas_cm0: sending port enable !! [ 3.046658] mpt2sas_cm0: host_add: handle(0x0001), sas_addr(0x5000000080000000), phys(8) [ 8.932108] mpt2sas_cm0: port enable: SUCCESS

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-381992320, or mute the thread https://github.com/notifications/unsubscribe-auth/AEegduC9N7kB5A3koCUF-iPoMnLViB47ks5tpe43gaJpZM4IpYmi .

satmandu commented 6 years ago

@brianmduncan For what it is worth, I think this is partially due to devices on the controller or the controller being allowed to go to into some sort of sleep mode and not waking up fast enough when asked by the filesystem to send data off of a drive. I'm having smartd also check the devices every so often as well, which is a hack, but I'm no longer getting the disk errors I was getting over the past several months.

I don't think zfs is to blame for these issues. But I do think that zfs isn't waiting long enough for something to come out of sleep before declaring the data unavailable.

In any case, my issues appear to have resolved, and scrubs (thank you zfs) have confirmed that I don't have any errors after restoring from my backups.

red-scorp commented 6 years ago

@brianmduncan my experience is similar to yours:

crimp42 commented 6 years ago

Details I have noticed.

Only occurs in Kernel 3.x - 4.x (that I have tested and confirmed) With 2.x kernel, does NOT occur.

I have the same issues with LSI2008 as LSI3008.

I have heard from others that if you use either of the two based cards, if you use ANY SAS drives, you won't have this issue. Just SATA drives.

It seems to be a kernel level issue introduced somewhere in the 3.x kernel, and can impact anything relying on the back end storage. In this case ZFS.

Even though I don't use ZFS, this was the most active thread I could find when I started researching this issue around a year ago. So I started posting in it. This issue has been going on for years. You can find threads associated to this behavior, and most of the time people point to hardware issues.

Just to be clear, these are the errors I get in my logs when a SATA drive wakes connected to either an LSI2008 or LSI3008, if I take the same drives and connect them to other adapters, I cannot reproduce this.

(And to be extra clear, now that I am using mpt3sas.msix_disable=1, the frequency seems to be much less, but still there)

Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885744 Apr 7 11:03:33 misc01 kernel: sd 0:0:7:0: [sdm] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Apr 7 11:03:33 misc01 kernel: sd 0:0:7:0: [sdm] tag#6 Sense Key : Not Ready [current] Apr 7 11:03:33 misc01 kernel: sd 0:0:7:0: [sdm] tag#6 Add. Sense: Logical unit not ready, initializing command required Apr 7 11:03:33 misc01 kernel: sd 0:0:7:0: [sdm] tag#6 CDB: Read(16) 88 00 00 00 00 01 99 00 11 38 00 00 00 08 00 00

This is one of my servers for a week period BEFORE adding the mpt3sas.msix_disable=1 (just grepping on the blk_update_request I/O error:

Apr 1 22:34:26 misc01 kernel: blk_update_request: I/O error, dev sdk, sector 244191520 Apr 4 15:27:36 misc01 kernel: blk_update_request: I/O error, dev sdi, sector 7325658456 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775304 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775312 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775320 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775328 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775336 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775344 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775352 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775360 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775368 Apr 6 01:10:20 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775376 Apr 6 01:49:45 misc01 kernel: blk_update_request: 8 callbacks suppressed Apr 6 01:49:45 misc01 kernel: blk_update_request: I/O error, dev sdh, sector 734192576 Apr 6 12:08:15 misc01 kernel: blk_update_request: I/O error, dev sdi, sector 2700308768 Apr 6 16:34:48 misc01 kernel: blk_update_request: I/O error, dev sdg, sector 5688 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885704 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885712 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885720 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885728 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885736 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885744 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885752 Apr 7 11:03:33 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 6861885760 Apr 8 22:32:05 misc01 kernel: blk_update_request: I/O error, dev sdk, sector 244191520

This is after putting it in place on April 16th: (no errors for the 22nd or 23rd)

Apr 15 18:00:06 misc01 kernel: blk_update_request: I/O error, dev sdm, sector 14306775344 Apr 16 14:44:43 misc01 kernel: blk_update_request: I/O error, dev sdi, sector 1479191832 Apr 16 21:38:20 misc01 kernel: blk_update_request: I/O error, dev sdi, sector 2136 Apr 17 18:02:30 misc01 kernel: blk_update_request: I/O error, dev sdg, sector 2472 Apr 17 21:40:57 misc01 kernel: blk_update_request: I/O error, dev sdh, sector 734192576 Apr 20 01:48:09 misc01 kernel: blk_update_request: I/O error, dev sdh, sector 13936 Apr 21 07:44:36 misc01 kernel: blk_update_request: I/O error, dev sdl, sector 488379224

On Mon, Apr 23, 2018 at 10:50 AM, Andriy Golovnya notifications@github.com wrote:

@brianmduncan https://github.com/brianmduncan my experience is similar to yours:

  • mpt3sas.msix_disable=1 did not help much
  • the bug appears only with LSI 3008 controller, use of other controllers HighPoint 2720, 4 x Marvell 9215 allow zfs to work fine
  • my LSI 3008 + 3X24R expander also loses SATA drives which are reappear later again but not available to zfs until reboot.
  • as @satmandu https://github.com/satmandu mentioned it might be that zfs does not wait for drives patient enough bit I do not see a reason why it waits with other controllers -> SAS controller/driver issue
  • mpt3sas has also missing_delay[2] parameter which might worth to try in this case but I did not find any information about unit and acceptable values for this parameter. HAS ANYONE ANY INFO ON THAT?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-383624833, or mute the thread https://github.com/notifications/unsubscribe-auth/AEegdj7LghhFAp-4DpKa2_21Ga-x-Temks5trfhcgaJpZM4IpYmi .

d-helios commented 6 years ago

I have heard from others that if you use either of the two based cards, if you use ANY SAS drives, you won't have this issue. Just SATA drives.

I have the same issue with SAS drives.

my configuration: JBOD: 216BE2C-R741JBOD HBA: lsi sas 9300-8e (Symbios Logic SAS3008) Drives: HUC101818CS4204, PX05SMB040

kernel parameters:

BOOT_IMAGE=/vmlinuz-4.15.0-23-generic root=UUID=4f30713c-5618-4c31-a051-97a9e5acee09 ro console=tty1 console=ttyS0,115200 dm_mod.use_blk_mq=y scsi_mod.use_blk_mq=y transparent_hugepage=never processor.max_cstate=1 udev.children-max=32 mpt3sas.msix_disable=1

dmesg:

[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:64.
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 67:48.
[Mon Jul  9 01:26:36 2018] sd 11:0:149:0: Attached scsi generic sg20 type 0
[Mon Jul  9 01:26:36 2018] sd 11:0:149:0: [sdw] 439541046 4096-byte logical blocks: (1.80 TB/1.64 TiB)
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:80.
[Mon Jul  9 01:26:36 2018] sd 11:0:149:0: [sdw] Write Protect is off
[Mon Jul  9 01:26:36 2018] sd 11:0:149:0: [sdw] Mode Sense: f7 00 10 08
[Mon Jul  9 01:26:36 2018] sd 11:0:149:0: [sdw] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:96.
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:112.
[Mon Jul  9 01:26:36 2018] scsi 11:0:150:0: Direct-Access     HGST     HUC101818CS4204  CDB0 PQ: 0 ANSI: 6
[Mon Jul  9 01:26:36 2018] scsi 11:0:150:0: SSP: handle(0x003b), sas_addr(0x5000cca02c6ad6ca), phy(36), device_name(0x5000cca02c6ad6cb)
[Mon Jul  9 01:26:36 2018] scsi 11:0:150:0: enclosure logical id (0x500304801f27a23f), slot(20)
[Mon Jul  9 01:26:36 2018] scsi 11:0:150:0: enclosure level(0x0000), connector name(     )
[Mon Jul  9 01:26:36 2018] scsi 11:0:150:0: Power-on or device reset occurred
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:128.
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:144.
[Mon Jul  9 01:26:36 2018] sd 11:0:150:0: Attached scsi generic sg21 type 0
[Mon Jul  9 01:26:36 2018] sd 11:0:150:0: [sdx] 439541046 4096-byte logical blocks: (1.80 TB/1.64 TiB)
[Mon Jul  9 01:26:36 2018] sd 11:0:147:0: [sdu] Attached SCSI disk
[Mon Jul  9 01:26:36 2018] sd 11:0:150:0: [sdx] Write Protect is off
[Mon Jul  9 01:26:36 2018] sd 11:0:150:0: [sdx] Mode Sense: f7 00 10 08
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 68:160.
[Mon Jul  9 01:26:36 2018] sd 11:0:150:0: [sdx] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 69:48.
[Mon Jul  9 01:26:36 2018] scsi 11:0:151:0: Direct-Access     TOSHIBA  PX05SMB040       0102 PQ: 0 ANSI: 6
[Mon Jul  9 01:26:36 2018] scsi 11:0:151:0: SSP: handle(0x003c), sas_addr(0x58ce38ee2012e4eb), phy(37), device_name(0x58ce38ee2012e4e8)
[Mon Jul  9 01:26:36 2018] scsi 11:0:151:0: enclosure logical id (0x500304801f27a23f), slot(21)
[Mon Jul  9 01:26:36 2018] scsi 11:0:151:0: enclosure level(0x0000), connector name(     )
[Mon Jul  9 01:26:36 2018] scsi 11:0:151:0: Power-on or device reset occurred
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 69:64.
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 69:80.
[Mon Jul  9 01:26:36 2018] sd 11:0:151:0: Attached scsi generic sg22 type 0
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 69:96.
[Mon Jul  9 01:26:36 2018] sd 11:0:151:0: [sdy] 97677846 4096-byte logical blocks: (400 GB/373 GiB)
[Mon Jul  9 01:26:36 2018] sd 11:0:151:0: [sdy] Write Protect is off
[Mon Jul  9 01:26:36 2018] sd 11:0:151:0: [sdy] Mode Sense: df 00 00 08
[Mon Jul  9 01:26:36 2018] device-mapper: multipath: Reinstating path 69:128.
[Mon Jul  9 01:26:36 2018] sd 11:0:151:0: [sdy] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Mon Jul  9 01:26:36 2018] scsi 11:0:161:0: Power-on or device reset occurred
[Mon Jul  9 01:26:36 2018] sd 11:0:158:0: [sdae] Attached SCSI disk
[Mon Jul  9 01:26:36 2018] sd 11:0:157:0: [sdad] Attached SCSI disk
[Mon Jul  9 01:26:36 2018] sd 11:0:161:0: Attached scsi generic sg32 type 0
[Mon Jul  9 01:26:36 2018] sd 11:0:161:0: [sdah] 439541046 4096-byte logical blocks: (1.80 TB/1.64 TiB)
[Mon Jul  9 01:26:36 2018] sd 11:0:161:0: [sdah] Write Protect is off
[Mon Jul  9 01:26:36 2018] sd 11:0:161:0: [sdah] Mode Sense: f7 00 10 08
[Mon Jul  9 01:26:36 2018] sd 11:0:161:0: [sdah] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Mon Jul  9 01:26:36 2018] scsi 11:0:162:0: Direct-Access     HGST     HUC101818CS4204  CDB0 PQ: 0 ANSI: 6
[Mon Jul  9 01:26:36 2018] scsi 11:0:162:0: SSP: handle(0x0054), sas_addr(0x5000cca02c691d0a), phy(8), device_name(0x5000cca02c691d0b)
[Mon Jul  9 01:26:36 2018] scsi 11:0:162:0: enclosure logical id (0x500304801f279a3f), slot(8)
[Mon Jul  9 01:26:36 2018] scsi 11:0:162:0: enclosure level(0x0001), connector name(     )
[Mon Jul  9 01:26:36 2018] scsi 11:0:162:0: Power-on or device reset occurred
[Mon Jul  9 01:26:36 2018] sd 11:0:162:0: Attached scsi generic sg33 type 0
[Mon Jul  9 01:26:36 2018] sd 11:0:162:0: [sdai] 439541046 4096-byte logical blocks: (1.80 TB/1.64 TiB)
[Mon Jul  9 01:26:36 2018] sd 11:0:162:0: [sdai] Write Protect is off
[Mon Jul  9 01:26:36 2018] sd 11:0:162:0: [sdai] Mode Sense: f7 00 10 08
[Mon Jul  9 01:26:36 2018] sd 11:0:162:0: [sdai] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Mon Jul  9 01:26:36 2018] sd 11:0:159:0: [sdaf] Attached SCSI disk
[Mon Jul  9 01:26:36 2018] scsi 11:0:163:0: Direct-Access     HGST     HUC101818CS4204  CDB0 PQ: 0 ANSI: 6
[Mon Jul  9 01:26:36 2018] scsi 11:0:163:0: SSP: handle(0x0055), sas_addr(0x5000cca02c69381a), phy(9), device_name(0x5000cca02c69381b)
[Mon Jul  9 01:26:36 2018] scsi 11:0:163:0: enclosure logical id (0x500304801f279a3f), slot(9)
[Mon Jul  9 01:26:36 2018] scsi 11:0:163:0: enclosure level(0x0001), connector name(     )
[Mon Jul  9 01:26:36 2018] sd 11:0:160:0: [sdag] Attached SCSI disk
[Mon Jul  9 01:26:36 2018] sd 11:0:161:0: [sdah] Attached SCSI disk
[Mon Jul  9 01:26:36 2018] sd 11:0:163:0: Attached scsi generic sg34 type 0
[Mon Jul  9 01:26:36 2018] sd 11:0:163:0: [sdaj] 439541046 4096-byte logical blocks: (1.80 TB/1.64 TiB)
[Mon Jul  9 01:26:36 2018] sd 11:0:163:0: [sdaj] Write Protect is off
[Mon Jul  9 01:26:36 2018] sd 11:0:163:0: [sdaj] Mode Sense: f7 00 10 08
[Mon Jul  9 01:26:36 2018] sd 11:0:163:0: [sdaj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[Mon Jul  9 01:26:36 2018] scsi 11:0:164:0: Direct-Access     HGST     HUC101818CS4204  CDB0 PQ: 0 ANSI: 6
[Mon Jul  9 01:26:36 2018] scsi 11:0:164:0: SSP: handle(0x0056), sas_addr(0x5000cca02c68c676), phy(10), device_name(0x5000cca02c68c677)
[Mon Jul  9 01:26:36 2018] scsi 11:0:164:0: enclosure logical id (0x500304801f279a3f), slot(10)
[Mon Jul  9 01:26:36 2018] scsi 11:0:164:0: enclosure level(0x0001), connector name(     )
[Mon Jul  9 10:52:56 2018] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[Mon Jul  9 10:52:56 2018] sd 11:0:199:0: [sdu] tag#5162 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Jul  9 10:52:56 2018] sd 11:0:199:0: [sdu] tag#5162 CDB: Test Unit Ready 00 00 00 00 00 00
[Mon Jul  9 10:53:26 2018] sd 11:0:223:0: attempting task abort! scmd(0000000050fff28d)
[Mon Jul  9 10:53:26 2018] sd 11:0:223:0: [sdar] tag#5161 CDB: Test Unit Ready 00 00 00 00 00 00
[Mon Jul  9 10:53:26 2018] scsi target11:0:223: handle(0x006b), sas_address(0x5000cca02c693c3e), phy(36)
[Mon Jul  9 10:53:26 2018] scsi target11:0:223: enclosure logical id(0x500304801f279a3f), slot(20)
[Mon Jul  9 10:53:26 2018] scsi target11:0:223: enclosure level(0x0001), connector name(     )
[Mon Jul  9 10:53:26 2018] sd 11:0:223:0: task abort: SUCCESS scmd(0000000050fff28d)
[Mon Jul  9 11:00:03 2018] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[Mon Jul  9 11:00:03 2018] sd 11:0:186:0: [sdh] tag#5209 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Jul  9 11:00:03 2018] sd 11:0:186:0: [sdh] tag#5209 CDB: Test Unit Ready 00 00 00 00 00 00
[Mon Jul  9 11:15:38 2018] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[Mon Jul  9 11:15:38 2018] sd 11:0:220:0: [sdao] tag#5382 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Jul  9 11:15:38 2018] sd 11:0:220:0: [sdao] tag#5382 CDB: Test Unit Ready 00 00 00 00 00 00
[Mon Jul  9 11:39:26 2018] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[Mon Jul  9 11:39:26 2018] sd 11:0:226:0: [sdau] tag#5459 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Jul  9 11:39:26 2018] sd 11:0:226:0: [sdau] tag#5459 CDB: Test Unit Ready 00 00 00 00 00 00
[Mon Jul  9 12:12:22 2018] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[Mon Jul  9 12:12:22 2018] sd 11:0:207:0: [sdab] tag#5642 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Jul  9 12:12:22 2018] sd 11:0:207:0: [sdab] tag#5642 CDB: Test Unit Ready 00 00 00 00 00 00
[Mon Jul  9 12:14:02 2018] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[Mon Jul  9 12:14:02 2018] sd 11:0:199:0: [sdu] tag#5743 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Mon Jul  9 12:14:02 2018] sd 11:0:199:0: [sdu] tag#5743 CDB: Test Unit Ready 00 00 00 00 00 00

Notice: I have the same configuration on running on solaris and it's work fine. the only thing that I changed is add power-condition:false statement to sd.conf:


root# cat /etc/driver/drv/sd.conf
# File is managed by Ansible

name="sd" class="scsi" target=0 lun=0;
name="sd" class="scsi" target=1 lun=0;
name="sd" class="scsi" target=2 lun=0;
name="sd" class="scsi" target=3 lun=0;
name="sd" class="scsi" target=4 lun=0;
name="sd" class="scsi" target=5 lun=0;
name="sd" class="scsi" target=6 lun=0;
name="sd" class="scsi" target=7 lun=0;
name="sd" class="scsi" target=8 lun=0;
name="sd" class="scsi" target=9 lun=0;
name="sd" class="scsi" target=10 lun=0;
name="sd" class="scsi" target=11 lun=0;
name="sd" class="scsi" target=12 lun=0;
name="sd" class="scsi" target=13 lun=0;
name="sd" class="scsi" target=14 lun=0;
name="sd" class="scsi" target=15 lun=0;

name="sd" class="scsi-self-identifying";

# Associate the driver with devid resolution.
#
ddi-devid-registrant=1;

sd-config-list=
"HGST    HUH", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,physical-block-size:4096",
"HGST    HUS72", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,physical-block-size:4096",
"HGST    HUC10", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,physical-block-size:4096",
"HGST    HUC15", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,physical-block-size:4096",
"HGST    HUSMH", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,throttle-max:32,disksort:false,cache-nonvolatile:true,power-condition:false,physical-block-size:4096",
"HGST    HUSMM", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,throttle-max:32,disksort:false,cache-nonvolatile:true,power-condition:false,physical-block-size:4096",
"TOSHIBA PX", "retries-timeout:1,retries-busy:1,retries-reset:1,retries-victim:2,throttle-max:32,disksort:false,cache-nonvolatile:true,power-condition:false,physical-block-size:4096";

# Start of lines added by ha-cluter/system/core
sd_retry_on_reservation_conflict=0;
# End of lines added by ha-cluster/system/core```
d-helios commented 6 years ago

Also I use multipath for sas devices and after event Power-on or device reset occurred, device-mapper reports that one of the path failed (or sometimes both paths).
After that resilvering started and zpool status -v reports:

errors: Permanent errors have been detected in the following files:

        /pool1/test/fs1/dKrka1d1/vdb.1_1.dir/vdb_f0027.file
        /pool1/test/fs1/dKrka1d1/vdb.1_1.dir/vdb_f0028.file

dmesg:

[ 6621.896780] sd 12:0:15:0: Power-on or device reset occurred
[ 6621.951166] device-mapper: multipath: Failing path 65:64.
[ 6621.981172] sd 12:0:16:0: Power-on or device reset occurred
[ 6622.011914] sd 12:0:19:0: Power-on or device reset occurred
[ 6622.065110] sd 12:0:17:0: Power-on or device reset occurred
[ 6622.226971] sd 12:0:18:0: Power-on or device reset occurred
[ 6622.404912] sd 12:0:21:0: Power-on or device reset occurred
[ 6622.655105] sd 12:0:26:0: Power-on or device reset occurred
[ 6645.320775] sd 12:0:0:0: attempting task abort! scmd(0000000078e06025)
[ 6645.320783] sd 12:0:0:0: [sdct] tag#6516 CDB: Read(10) 28 00 1a 32 dd 15 00 00 01 00
[ 6645.320788] scsi target12:0:0: handle(0x000a), sas_address(0x5000cca02c6aa6ed), phy(0)
[ 6645.320792] scsi target12:0:0: enclosure logical id(0x500304801f27a23f), slot(0)
[ 6645.320795] scsi target12:0:0: enclosure level(0x0000), connector name(     )
[ 6645.324324] sd 12:0:0:0: task abort: SUCCESS scmd(0000000078e06025)
[ 6645.337023] sd 12:0:0:0: Power-on or device reset occurred
[ 6645.544779] sd 12:0:29:0: attempting task abort! scmd(0000000039caa0d4)
[ 6645.544788] sd 12:0:29:0: [sddv] tag#4721 CDB: Read(10) 28 00 05 7a e9 2c 00 00 01 00
[ 6645.544793] scsi target12:0:29: handle(0x0029), sas_address(0x5000cca02c695e05), phy(4)
[ 6645.544797] scsi target12:0:29: enclosure logical id(0x500304801f279a3f), slot(4)
[ 6645.544800] scsi target12:0:29: enclosure level(0x0001), connector name(     )
[ 6645.548905] sd 12:0:29:0: task abort: SUCCESS scmd(0000000039caa0d4)
[ 6645.548918] sd 12:0:29:0: [sddv] tag#4721 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 6645.548922] sd 12:0:29:0: [sddv] tag#4721 CDB: Read(10) 28 00 05 7a e9 2c 00 00 01 00
[ 6645.548925] print_req_error: I/O error, dev sddv, sector 735529312
[ 6645.556913] device-mapper: multipath: Failing path 71:208.
[ 6645.557008] device-mapper: multipath: Failing path 65:208.
[ 6645.660769] print_req_error: I/O error, dev dm-35, sector 735529312
[ 6645.660837] print_req_error: I/O error, dev dm-35, sector 528
[ 6645.660861] print_req_error: I/O error, dev dm-35, sector 3516326928
[ 6645.660870] print_req_error: I/O error, dev dm-35, sector 3516327440
[ 6646.216764] sd 12:0:27:0: attempting task abort! scmd(0000000086b1e564)
[ 6646.216770] sd 12:0:27:0: [sddt] tag#3214 CDB: Write(10) 2a 00 05 7a d1 ff 00 00 01 00
[ 6646.216775] scsi target12:0:27: handle(0x0027), sas_address(0x5000cca02c695a99), phy(2)
[ 6646.216779] scsi target12:0:27: enclosure logical id(0x500304801f279a3f), slot(2)
[ 6646.216782] scsi target12:0:27: enclosure level(0x0001), connector name(     )
[ 6646.220839] sd 12:0:27:0: task abort: SUCCESS scmd(0000000086b1e564)
[ 6646.220850] sd 12:0:27:0: [sddt] tag#3214 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 6646.220855] sd 12:0:27:0: [sddt] tag#3214 CDB: Write(10) 2a 00 05 7a d1 ff 00 00 01 00
[ 6646.220858] print_req_error: I/O error, dev sddt, sector 735481848
[ 6646.228623] sd 12:0:32:0: attempting task abort! scmd(000000006f25d237)
[ 6646.228628] sd 12:0:32:0: [sddy] tag#4092 CDB: Write(10) 2a 00 05 7a d1 dd 00 00 01 00
[ 6646.228633] scsi target12:0:32: handle(0x002c), sas_address(0x5000cca02c693ff5), phy(7)
[ 6646.228636] scsi target12:0:32: enclosure logical id(0x500304801f279a3f), slot(7)
[ 6646.228639] scsi target12:0:32: enclosure level(0x0001), connector name(     )
[ 6646.228659] device-mapper: multipath: Failing path 71:176.
[ 6646.232565] sd 12:0:32:0: task abort: SUCCESS scmd(000000006f25d237)
[ 6646.232572] sd 12:0:32:0: [sddy] tag#4092 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 6646.232575] sd 12:0:32:0: [sddy] tag#4092 CDB: Write(10) 2a 00 05 7a d1 dd 00 00 01 00
[ 6646.232576] print_req_error: I/O error, dev sddy, sector 735481576
[ 6646.239932] sd 12:0:27:0: attempting task abort! scmd(000000007e03c3b8)
[ 6646.239935] sd 12:0:27:0: [sddt] tag#1235 CDB: Write(10) 2a 00 05 7a d1 f8 00 00 01 00
[ 6646.239945] scsi target12:0:27: handle(0x0027), sas_address(0x5000cca02c695a99), phy(2)
[ 6646.239951] device-mapper: multipath: Failing path 128:0.
[ 6646.239959] scsi target12:0:27: enclosure logical id(0x500304801f279a3f), slot(2)
[ 6646.239973] scsi target12:0:27: enclosure level(0x0001), connector name(     )
[ 6646.239982] print_req_error: I/O error, dev dm-38, sector 735481576
[ 6646.243490] sd 12:0:27:0: task abort: SUCCESS scmd(000000007e03c3b8)
[ 6646.243497] sd 12:0:27:0: [sddt] tag#1235 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 6646.243499] sd 12:0:27:0: [sddt] tag#1235 CDB: Write(10) 2a 00 05 7a d1 f8 00 00 01 00
[ 6646.243500] print_req_error: I/O error, dev sddt, sector 735481792
[ 6646.243510] sd 12:0:32:0: attempting task abort! scmd(00000000f3a95c3d)
[ 6646.243513] sd 12:0:32:0: [sddy] tag#1236 CDB: Write(10) 2a 00 05 7a d1 d9 00 00 01 00
[ 6646.243515] scsi target12:0:32: handle(0x002c), sas_address(0x5000cca02c693ff5), phy(7)
[ 6646.243516] scsi target12:0:32: enclosure logical id(0x500304801f279a3f), slot(7)
[ 6646.243518] scsi target12:0:32: enclosure level(0x0001), connector name(     )
[ 6646.246892] sd 12:0:32:0: task abort: SUCCESS scmd(00000000f3a95c3d)
[ 6646.246900] sd 12:0:32:0: [sddy] tag#1236 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 6646.246904] sd 12:0:32:0: [sddy] tag#1236 CDB: Write(10) 2a 00 05 7a d1 d9 00 00 01 00
[ 6646.246906] print_req_error: I/O error, dev sddy, sector 735481544
[ 6646.247802] sd 12:0:31:0: Power-on or device reset occurred
[ 6646.259016] Buffer I/O error on dev dm-38, logical block 439541044, async page read
[ 6646.261364] Buffer I/O error on dev dm-38, logical block 439541044, async page read
[ 6646.261397] Buffer I/O error on dev dm-38, logical block 1, async page read
[ 6646.406037] sd 12:0:40:0: Power-on or device reset occurred
[ 6646.406077] sd 12:0:42:0: Power-on or device reset occurred
[ 6646.406260] sd 12:0:43:0: Power-on or device reset occurred
[ 6646.406291] sd 12:0:44:0: Power-on or device reset occurred
[ 6646.406330] sd 12:0:36:0: Power-on or device reset occurred
[ 6646.406332] sd 12:0:35:0: Power-on or device reset occurred
[ 6646.406351] sd 12:0:37:0: Power-on or device reset occurred
[ 6646.406362] sd 12:0:39:0: Power-on or device reset occurred
[ 6646.406366] sd 12:0:38:0: Power-on or device reset occurred
[ 6646.480633] device-mapper: multipath: Failing path 66:80.
[ 6646.480826] device-mapper: multipath: Failing path 66:16.
[ 6646.480967] device-mapper: multipath: Reinstating path 128:0.
[ 6646.572419] sd 12:0:29:0: Power-on or device reset occurred
[ 6647.536395] device-mapper: multipath: Failing path 66:160.
[ 6647.536562] device-mapper: multipath: Failing path 66:176.
[ 6647.536729] device-mapper: multipath: Failing path 66:192.
[ 6647.536876] device-mapper: multipath: Failing path 66:208.
[ 6647.537024] device-mapper: multipath: Failing path 66:224.
[ 6647.537159] device-mapper: multipath: Failing path 66:240.
[ 6647.537295] device-mapper: multipath: Failing path 67:0.
[ 6647.555543] sd 12:0:45:0: Power-on or device reset occurred
[ 6647.570652] sd 12:0:46:0: Power-on or device reset occurred
[ 6647.571783] sd 12:0:47:0: Power-on or device reset occurred
[ 6647.572823] sd 12:0:48:0: Power-on or device reset occurred
[ 6767.291099] INFO: task systemd-udevd:44154 blocked for more than 120 seconds.
[ 6767.300099]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.308202] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.317407] systemd-udevd   D    0 44154    975 0x00000100
[ 6767.317412] Call Trace:
[ 6767.317429]  __schedule+0x297/0x8b0
[ 6767.317435]  ? get_work_pool+0x40/0x40
[ 6767.317438]  schedule+0x2c/0x80
[ 6767.317446]  schedule_timeout+0x1cf/0x350
[ 6767.317463]  ? sched_clock+0x9/0x10
[ 6767.317470]  ? sched_clock+0x9/0x10
[ 6767.317481]  ? sched_clock_cpu+0x11/0xb0
[ 6767.317483]  ? get_work_pool+0x40/0x40
[ 6767.317485]  wait_for_completion+0xba/0x140
[ 6767.317487]  ? wake_up_q+0x80/0x80
[ 6767.317491]  flush_work+0x126/0x1e0
[ 6767.317493]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.317496]  __cancel_work_timer+0x131/0x1b0
[ 6767.317503]  ? exact_lock+0x11/0x20
[ 6767.317506]  cancel_delayed_work_sync+0x13/0x20
[ 6767.317508]  disk_block_events+0x78/0x80
[ 6767.317514]  __blkdev_get+0x69/0x4c0
[ 6767.317519]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.317521]  blkdev_get+0x129/0x320
[ 6767.317524]  blkdev_open+0x95/0xf0
[ 6767.317530]  do_dentry_open+0x1c2/0x310
[ 6767.317539]  ? __inode_permission+0x5b/0x160
[ 6767.317558]  ? bd_acquire+0xd0/0xd0
[ 6767.317566]  vfs_open+0x4f/0x80
[ 6767.317573]  path_openat+0x66e/0x1770
[ 6767.317582]  ? _copy_to_user+0x26/0x30
[ 6767.317590]  ? move_addr_to_user+0xc0/0xe0
[ 6767.317597]  do_filp_open+0x9b/0x110
[ 6767.317605]  ? __check_object_size+0xaf/0x1b0
[ 6767.317615]  ? __alloc_fd+0x46/0x170
[ 6767.317617]  do_sys_open+0x1bb/0x2c0
[ 6767.317619]  ? do_sys_open+0x1bb/0x2c0
[ 6767.317621]  SyS_openat+0x14/0x20
[ 6767.317625]  do_syscall_64+0x73/0x130
[ 6767.317627]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.317630] RIP: 0033:0x7f58aabf3c8e
[ 6767.317631] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.317633] RAX: ffffffffffffffda RBX: 000055a3414dd820 RCX: 00007f58aabf3c8e
[ 6767.317634] RDX: 00000000000a0800 RSI: 000055a3414fac60 RDI: 00000000ffffff9c
[ 6767.317634] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a3414e59c0
[ 6767.317635] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.317636] R13: 000055a3414e22b0 R14: 000055a3414e22b0 R15: 00007ffcb2fdfb5c
[ 6767.317639] INFO: task systemd-udevd:44169 blocked for more than 120 seconds.
[ 6767.325975]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.333924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.343047] systemd-udevd   D    0 44169    975 0x00000100
[ 6767.343049] Call Trace:
[ 6767.343053]  __schedule+0x297/0x8b0
[ 6767.343055]  ? get_work_pool+0x40/0x40
[ 6767.343057]  schedule+0x2c/0x80
[ 6767.343058]  schedule_timeout+0x1cf/0x350
[ 6767.343064]  ? ttwu_do_activate+0x7a/0x90
[ 6767.343065]  ? get_work_pool+0x40/0x40
[ 6767.343066]  wait_for_completion+0xba/0x140
[ 6767.343068]  ? wake_up_q+0x80/0x80
[ 6767.343069]  flush_work+0x126/0x1e0
[ 6767.343071]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.343072]  __cancel_work_timer+0x131/0x1b0
[ 6767.343075]  ? exact_lock+0x11/0x20
[ 6767.343077]  cancel_delayed_work_sync+0x13/0x20
[ 6767.343079]  disk_block_events+0x78/0x80
[ 6767.343081]  __blkdev_get+0x69/0x4c0
[ 6767.343082]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.343084]  blkdev_get+0x129/0x320
[ 6767.343086]  blkdev_open+0x95/0xf0
[ 6767.343088]  do_dentry_open+0x1c2/0x310
[ 6767.343092]  ? __inode_permission+0x5b/0x160
[ 6767.343093]  ? bd_acquire+0xd0/0xd0
[ 6767.343095]  vfs_open+0x4f/0x80
[ 6767.343096]  path_openat+0x66e/0x1770
[ 6767.343099]  ? _copy_to_user+0x26/0x30
[ 6767.343100]  ? move_addr_to_user+0xc0/0xe0
[ 6767.343103]  do_filp_open+0x9b/0x110
[ 6767.343105]  ? __check_object_size+0xaf/0x1b0
[ 6767.343107]  ? __alloc_fd+0x46/0x170
[ 6767.343109]  do_sys_open+0x1bb/0x2c0
[ 6767.343111]  ? do_sys_open+0x1bb/0x2c0
[ 6767.343113]  SyS_openat+0x14/0x20
[ 6767.343115]  do_syscall_64+0x73/0x130
[ 6767.343117]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.343118] RIP: 0033:0x7f58aabf3c8e
[ 6767.343118] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.343120] RAX: ffffffffffffffda RBX: 000055a3414e1b50 RCX: 00007f58aabf3c8e
[ 6767.343121] RDX: 00000000000a0800 RSI: 000055a3414e4f50 RDI: 00000000ffffff9c
[ 6767.343122] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a341674da0
[ 6767.343123] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.343124] R13: 000055a3414f74e0 R14: 000055a3414f74e0 R15: 00007ffcb2fdfb5c
[ 6767.343126] INFO: task systemd-udevd:44275 blocked for more than 120 seconds.
[ 6767.351476]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.359439] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.368582] systemd-udevd   D    0 44275    975 0x00000100
[ 6767.368584] Call Trace:
[ 6767.368588]  __schedule+0x297/0x8b0
[ 6767.368589]  ? get_work_pool+0x40/0x40
[ 6767.368591]  schedule+0x2c/0x80
[ 6767.368593]  schedule_timeout+0x1cf/0x350
[ 6767.368595]  ? ttwu_do_activate+0x7a/0x90
[ 6767.368597]  ? get_work_pool+0x40/0x40
[ 6767.368598]  wait_for_completion+0xba/0x140
[ 6767.368599]  ? wake_up_q+0x80/0x80
[ 6767.368601]  flush_work+0x126/0x1e0
[ 6767.368602]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.368604]  __cancel_work_timer+0x131/0x1b0
[ 6767.368606]  ? exact_lock+0x11/0x20
[ 6767.368608]  cancel_delayed_work_sync+0x13/0x20
[ 6767.368610]  disk_block_events+0x78/0x80
[ 6767.368612]  __blkdev_get+0x69/0x4c0
[ 6767.368618]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.368625]  blkdev_get+0x129/0x320
[ 6767.368634]  blkdev_open+0x95/0xf0
[ 6767.368641]  do_dentry_open+0x1c2/0x310
[ 6767.368646]  ? __inode_permission+0x5b/0x160
[ 6767.368651]  ? bd_acquire+0xd0/0xd0
[ 6767.368657]  vfs_open+0x4f/0x80
[ 6767.368664]  path_openat+0x66e/0x1770
[ 6767.368671]  ? _copy_to_user+0x26/0x30
[ 6767.368678]  ? move_addr_to_user+0xc0/0xe0
[ 6767.368682]  do_filp_open+0x9b/0x110
[ 6767.368684]  ? __check_object_size+0xaf/0x1b0
[ 6767.368687]  ? __alloc_fd+0x46/0x170
[ 6767.368688]  do_sys_open+0x1bb/0x2c0
[ 6767.368690]  ? do_sys_open+0x1bb/0x2c0
[ 6767.368692]  SyS_openat+0x14/0x20
[ 6767.368694]  do_syscall_64+0x73/0x130
[ 6767.368696]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.368697] RIP: 0033:0x7f58aabf3c8e
[ 6767.368697] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.368699] RAX: ffffffffffffffda RBX: 000055a3414e1790 RCX: 00007f58aabf3c8e
[ 6767.368700] RDX: 00000000000a0800 RSI: 000055a3414f5270 RDI: 00000000ffffff9c
[ 6767.368700] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a3414e2aa0
[ 6767.368701] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.368702] R13: 000055a3414f74e0 R14: 000055a3414f74e0 R15: 00007ffcb2fdfb5c
[ 6767.368704] INFO: task systemd-udevd:44277 blocked for more than 120 seconds.
[ 6767.377075]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.385090] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.394267] systemd-udevd   D    0 44277    975 0x00000100
[ 6767.394268] Call Trace:
[ 6767.394272]  __schedule+0x297/0x8b0
[ 6767.394274]  ? get_work_pool+0x40/0x40
[ 6767.394276]  schedule+0x2c/0x80
[ 6767.394277]  schedule_timeout+0x1cf/0x350
[ 6767.394279]  ? sched_clock+0x9/0x10
[ 6767.394283]  ? select_idle_sibling+0x29/0x410
[ 6767.394285]  ? sched_clock+0x9/0x10
[ 6767.394286]  ? sched_clock_cpu+0x11/0xb0
[ 6767.394287]  ? get_work_pool+0x40/0x40
[ 6767.394288]  wait_for_completion+0xba/0x140
[ 6767.394290]  ? wake_up_q+0x80/0x80
[ 6767.394291]  flush_work+0x126/0x1e0
[ 6767.394293]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.394294]  __cancel_work_timer+0x131/0x1b0
[ 6767.394297]  ? exact_lock+0x11/0x20
[ 6767.394303]  cancel_delayed_work_sync+0x13/0x20
[ 6767.394310]  disk_block_events+0x78/0x80
[ 6767.394320]  __blkdev_get+0x69/0x4c0
[ 6767.394325]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.394331]  blkdev_get+0x129/0x320
[ 6767.394338]  blkdev_open+0x95/0xf0
[ 6767.394344]  do_dentry_open+0x1c2/0x310
[ 6767.394349]  ? __inode_permission+0x5b/0x160
[ 6767.394354]  ? bd_acquire+0xd0/0xd0
[ 6767.394361]  vfs_open+0x4f/0x80
[ 6767.394366]  path_openat+0x66e/0x1770
[ 6767.394368]  ? _copy_to_user+0x26/0x30
[ 6767.394370]  ? move_addr_to_user+0xc0/0xe0
[ 6767.394372]  do_filp_open+0x9b/0x110
[ 6767.394374]  ? __check_object_size+0xaf/0x1b0
[ 6767.394376]  ? __alloc_fd+0x46/0x170
[ 6767.394378]  do_sys_open+0x1bb/0x2c0
[ 6767.394379]  ? do_sys_open+0x1bb/0x2c0
[ 6767.394382]  SyS_openat+0x14/0x20
[ 6767.394384]  do_syscall_64+0x73/0x130
[ 6767.394385]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.394386] RIP: 0033:0x7f58aabf3c8e
[ 6767.394387] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.394388] RAX: ffffffffffffffda RBX: 000055a3414e11b0 RCX: 00007f58aabf3c8e
[ 6767.394389] RDX: 00000000000a0800 RSI: 000055a3415661b0 RDI: 00000000ffffff9c
[ 6767.394390] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a3414dc010
[ 6767.394390] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.394391] R13: 000055a34153eed0 R14: 000055a34153eed0 R15: 00007ffcb2fdfb5c
[ 6767.394393] INFO: task systemd-udevd:44294 blocked for more than 120 seconds.
[ 6767.402806]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.410843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.420056] systemd-udevd   D    0 44294    975 0x00000100
[ 6767.420058] Call Trace:
[ 6767.420062]  __schedule+0x297/0x8b0
[ 6767.420064]  ? __update_load_avg_se.isra.38+0x1c0/0x1d0
[ 6767.420065]  ? get_work_pool+0x40/0x40
[ 6767.420067]  schedule+0x2c/0x80
[ 6767.420068]  schedule_timeout+0x1cf/0x350
[ 6767.420071]  ? sched_clock+0x9/0x10
[ 6767.420072]  ? sched_clock+0x9/0x10
[ 6767.420074]  ? sched_clock_cpu+0x11/0xb0
[ 6767.420075]  ? get_work_pool+0x40/0x40
[ 6767.420076]  wait_for_completion+0xba/0x140
[ 6767.420077]  ? wake_up_q+0x80/0x80
[ 6767.420079]  flush_work+0x126/0x1e0
[ 6767.420080]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.420082]  __cancel_work_timer+0x131/0x1b0
[ 6767.420084]  ? exact_lock+0x11/0x20
[ 6767.420085]  cancel_delayed_work_sync+0x13/0x20
[ 6767.420087]  disk_block_events+0x78/0x80
[ 6767.420094]  __blkdev_get+0x69/0x4c0
[ 6767.420099]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.420108]  blkdev_get+0x129/0x320
[ 6767.420115]  blkdev_open+0x95/0xf0
[ 6767.420121]  do_dentry_open+0x1c2/0x310
[ 6767.420126]  ? __inode_permission+0x5b/0x160
[ 6767.420132]  ? bd_acquire+0xd0/0xd0
[ 6767.420139]  vfs_open+0x4f/0x80
[ 6767.420145]  path_openat+0x66e/0x1770
[ 6767.420152]  ? _copy_to_user+0x26/0x30
[ 6767.420158]  ? move_addr_to_user+0xc0/0xe0
[ 6767.420164]  do_filp_open+0x9b/0x110
[ 6767.420166]  ? __check_object_size+0xaf/0x1b0
[ 6767.420168]  ? __alloc_fd+0x46/0x170
[ 6767.420169]  do_sys_open+0x1bb/0x2c0
[ 6767.420171]  ? do_sys_open+0x1bb/0x2c0
[ 6767.420173]  SyS_openat+0x14/0x20
[ 6767.420175]  do_syscall_64+0x73/0x130
[ 6767.420176]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.420177] RIP: 0033:0x7f58aabf3c8e
[ 6767.420178] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.420179] RAX: ffffffffffffffda RBX: 000055a3414f4280 RCX: 00007f58aabf3c8e
[ 6767.420180] RDX: 00000000000a0800 RSI: 000055a3415514e0 RDI: 00000000ffffff9c
[ 6767.420181] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a3414f84b0
[ 6767.420181] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.420182] R13: 000055a3414f3690 R14: 000055a3414f3690 R15: 00007ffcb2fdfb5c
[ 6767.420185] INFO: task systemd-udevd:44467 blocked for more than 120 seconds.
[ 6767.428631]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.436698] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.445966] systemd-udevd   D    0 44467    975 0x00000100
[ 6767.445968] Call Trace:
[ 6767.445971]  __schedule+0x297/0x8b0
[ 6767.445973]  ? get_work_pool+0x40/0x40
[ 6767.445975]  schedule+0x2c/0x80
[ 6767.445976]  schedule_timeout+0x1cf/0x350
[ 6767.445978]  ? ttwu_do_activate+0x7a/0x90
[ 6767.445980]  ? get_work_pool+0x40/0x40
[ 6767.445981]  wait_for_completion+0xba/0x140
[ 6767.445982]  ? wake_up_q+0x80/0x80
[ 6767.445983]  flush_work+0x126/0x1e0
[ 6767.445985]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.445986]  __cancel_work_timer+0x131/0x1b0
[ 6767.445989]  ? exact_lock+0x11/0x20
[ 6767.445990]  cancel_delayed_work_sync+0x13/0x20
[ 6767.445992]  disk_block_events+0x78/0x80
[ 6767.445994]  __blkdev_get+0x69/0x4c0
[ 6767.445996]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.445997]  blkdev_get+0x129/0x320
[ 6767.445999]  blkdev_open+0x95/0xf0
[ 6767.446001]  do_dentry_open+0x1c2/0x310
[ 6767.446002]  ? __inode_permission+0x5b/0x160
[ 6767.446004]  ? bd_acquire+0xd0/0xd0
[ 6767.446005]  vfs_open+0x4f/0x80
[ 6767.446007]  path_openat+0x66e/0x1770
[ 6767.446009]  ? _copy_to_user+0x26/0x30
[ 6767.446011]  ? move_addr_to_user+0xc0/0xe0
[ 6767.446013]  do_filp_open+0x9b/0x110
[ 6767.446018]  ? __check_object_size+0xaf/0x1b0
[ 6767.446020]  ? __alloc_fd+0x46/0x170
[ 6767.446022]  do_sys_open+0x1bb/0x2c0
[ 6767.446023]  ? do_sys_open+0x1bb/0x2c0
[ 6767.446026]  SyS_openat+0x14/0x20
[ 6767.446027]  do_syscall_64+0x73/0x130
[ 6767.446029]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.446030] RIP: 0033:0x7f58aabf3c8e
[ 6767.446031] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.446032] RAX: ffffffffffffffda RBX: 000055a3414f4280 RCX: 00007f58aabf3c8e
[ 6767.446033] RDX: 00000000000a0800 RSI: 000055a3414f2f90 RDI: 00000000ffffff9c
[ 6767.446033] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a3414f5e70
[ 6767.446034] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.446035] R13: 000055a3414f3690 R14: 000055a3414f3690 R15: 00007ffcb2fdfb5c
[ 6767.446041] INFO: task systemd-udevd:44740 blocked for more than 120 seconds.
[ 6767.454524]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6767.462630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6767.471915] systemd-udevd   D    0 44740    975 0x00000100
[ 6767.471917] Call Trace:
[ 6767.471921]  __schedule+0x297/0x8b0
[ 6767.471922]  ? get_work_pool+0x40/0x40
[ 6767.471924]  schedule+0x2c/0x80
[ 6767.471925]  schedule_timeout+0x1cf/0x350
[ 6767.471928]  ? ttwu_do_activate+0x7a/0x90
[ 6767.471929]  ? get_work_pool+0x40/0x40
[ 6767.471931]  wait_for_completion+0xba/0x140
[ 6767.471932]  ? wake_up_q+0x80/0x80
[ 6767.471933]  flush_work+0x126/0x1e0
[ 6767.471935]  ? worker_detach_from_pool+0xa0/0xa0
[ 6767.471936]  __cancel_work_timer+0x131/0x1b0
[ 6767.471938]  ? exact_lock+0x11/0x20
[ 6767.471940]  cancel_delayed_work_sync+0x13/0x20
[ 6767.471942]  disk_block_events+0x78/0x80
[ 6767.471944]  __blkdev_get+0x69/0x4c0
[ 6767.471945]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6767.471947]  blkdev_get+0x129/0x320
[ 6767.471949]  blkdev_open+0x95/0xf0
[ 6767.471951]  do_dentry_open+0x1c2/0x310
[ 6767.471952]  ? __inode_permission+0x5b/0x160
[ 6767.471954]  ? bd_acquire+0xd0/0xd0
[ 6767.471956]  vfs_open+0x4f/0x80
[ 6767.471957]  path_openat+0x66e/0x1770
[ 6767.471960]  ? _copy_to_user+0x26/0x30
[ 6767.471961]  ? move_addr_to_user+0xc0/0xe0
[ 6767.471963]  do_filp_open+0x9b/0x110
[ 6767.471969]  ? __check_object_size+0xaf/0x1b0
[ 6767.471971]  ? __alloc_fd+0x46/0x170
[ 6767.471973]  do_sys_open+0x1bb/0x2c0
[ 6767.471975]  ? do_sys_open+0x1bb/0x2c0
[ 6767.471977]  SyS_openat+0x14/0x20
[ 6767.471979]  do_syscall_64+0x73/0x130
[ 6767.471980]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6767.471981] RIP: 0033:0x7f58aabf3c8e
[ 6767.471982] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6767.471983] RAX: ffffffffffffffda RBX: 000055a3414e11b0 RCX: 00007f58aabf3c8e
[ 6767.471984] RDX: 00000000000a0800 RSI: 000055a3414e18a0 RDI: 00000000ffffff9c
[ 6767.471985] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a341551500
[ 6767.471986] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6767.471987] R13: 000055a3414e21d0 R14: 000055a3414e21d0 R15: 00007ffcb2fdfb5c
[ 6888.121045] INFO: task multipathd:5022 blocked for more than 120 seconds.
[ 6888.130085]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6888.138358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6888.147772] multipathd      D    0  5022      1 0x00000000
[ 6888.147777] Call Trace:
[ 6888.147793]  __schedule+0x297/0x8b0
[ 6888.147795]  schedule+0x2c/0x80
[ 6888.147798]  schedule_preempt_disabled+0xe/0x10
[ 6888.147800]  __mutex_lock.isra.2+0x18c/0x4d0
[ 6888.147811]  ? __slab_free+0x14d/0x2c0
[ 6888.147823]  ? get_disk+0x40/0x70
[ 6888.147834]  __mutex_lock_slowpath+0x13/0x20
[ 6888.147839]  ? __mutex_lock_slowpath+0x13/0x20
[ 6888.147843]  mutex_lock+0x2f/0x40
[ 6888.147846]  disk_block_events+0x31/0x80
[ 6888.147851]  __blkdev_get+0x69/0x4c0
[ 6888.147855]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6888.147858]  blkdev_get+0x129/0x320
[ 6888.147860]  blkdev_open+0x95/0xf0
[ 6888.147865]  do_dentry_open+0x1c2/0x310
[ 6888.147866]  ? __inode_permission+0x5b/0x160
[ 6888.147868]  ? bd_acquire+0xd0/0xd0
[ 6888.147870]  vfs_open+0x4f/0x80
[ 6888.147872]  path_openat+0x66e/0x1770
[ 6888.147878]  ? vsnprintf+0xf0/0x4e0
[ 6888.147881]  do_filp_open+0x9b/0x110
[ 6888.147884]  ? __check_object_size+0xaf/0x1b0
[ 6888.147889]  ? __alloc_fd+0x46/0x170
[ 6888.147891]  do_sys_open+0x1bb/0x2c0
[ 6888.147893]  ? do_sys_open+0x1bb/0x2c0
[ 6888.147895]  ? _cond_resched+0x19/0x40
[ 6888.147897]  SyS_openat+0x14/0x20
[ 6888.147903]  do_syscall_64+0x73/0x130
[ 6888.147922]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6888.147928] RIP: 0033:0x7f266ac3bdae
[ 6888.147935] RSP: 002b:00007f266ba6d7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[ 6888.147947] RAX: ffffffffffffffda RBX: 00007f24d4018fb0 RCX: 00007f266ac3bdae
[ 6888.147952] RDX: 0000000000000000 RSI: 00007f266407c510 RDI: 00000000ffffff9c
[ 6888.147955] RBP: 000055898234b5e0 R08: 0000000000000000 R09: 0000000000000005
[ 6888.147961] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000003f
[ 6888.147967] R13: 0000000000000003 R14: 0000000000000000 R15: 000055898234b5e0
[ 6888.148134] INFO: task systemd-udevd:44154 blocked for more than 120 seconds.
[ 6888.156726]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6888.164921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6888.174291] systemd-udevd   D    0 44154    975 0x00000104
[ 6888.174293] Call Trace:
[ 6888.174297]  __schedule+0x297/0x8b0
[ 6888.174301]  ? get_work_pool+0x40/0x40
[ 6888.174302]  schedule+0x2c/0x80
[ 6888.174304]  schedule_timeout+0x1cf/0x350
[ 6888.174311]  ? sched_clock+0x9/0x10
[ 6888.174312]  ? sched_clock+0x9/0x10
[ 6888.174317]  ? sched_clock_cpu+0x11/0xb0
[ 6888.174319]  ? get_work_pool+0x40/0x40
[ 6888.174320]  wait_for_completion+0xba/0x140
[ 6888.174322]  ? wake_up_q+0x80/0x80
[ 6888.174325]  flush_work+0x126/0x1e0
[ 6888.174327]  ? worker_detach_from_pool+0xa0/0xa0
[ 6888.174329]  __cancel_work_timer+0x131/0x1b0
[ 6888.174331]  ? exact_lock+0x11/0x20
[ 6888.174337]  cancel_delayed_work_sync+0x13/0x20
[ 6888.174343]  disk_block_events+0x78/0x80
[ 6888.174350]  __blkdev_get+0x69/0x4c0
[ 6888.174360]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6888.174366]  blkdev_get+0x129/0x320
[ 6888.174373]  blkdev_open+0x95/0xf0
[ 6888.174380]  do_dentry_open+0x1c2/0x310
[ 6888.174385]  ? __inode_permission+0x5b/0x160
[ 6888.174391]  ? bd_acquire+0xd0/0xd0
[ 6888.174397]  vfs_open+0x4f/0x80
[ 6888.174398]  path_openat+0x66e/0x1770
[ 6888.174404]  ? _copy_to_user+0x26/0x30
[ 6888.174409]  ? move_addr_to_user+0xc0/0xe0
[ 6888.174411]  do_filp_open+0x9b/0x110
[ 6888.174413]  ? __check_object_size+0xaf/0x1b0
[ 6888.174415]  ? __alloc_fd+0x46/0x170
[ 6888.174417]  do_sys_open+0x1bb/0x2c0
[ 6888.174419]  ? do_sys_open+0x1bb/0x2c0
[ 6888.174421]  SyS_openat+0x14/0x20
[ 6888.174423]  do_syscall_64+0x73/0x130
[ 6888.174425]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6888.174426] RIP: 0033:0x7f58aabf3c8e
[ 6888.174427] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6888.174429] RAX: ffffffffffffffda RBX: 000055a3414dd820 RCX: 00007f58aabf3c8e
[ 6888.174430] RDX: 00000000000a0800 RSI: 000055a3414fac60 RDI: 00000000ffffff9c
[ 6888.174431] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a3414e59c0
[ 6888.174431] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6888.174433] R13: 000055a3414e22b0 R14: 000055a3414e22b0 R15: 00007ffcb2fdfb5c
[ 6888.174435] INFO: task systemd-udevd:44169 blocked for more than 120 seconds.
[ 6888.183040]       Tainted: P           OE    4.15.0-23-generic #25-Ubuntu
[ 6888.191271] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6888.200677] systemd-udevd   D    0 44169    975 0x00000104
[ 6888.200679] Call Trace:
[ 6888.200683]  __schedule+0x297/0x8b0
[ 6888.200685]  ? get_work_pool+0x40/0x40
[ 6888.200687]  schedule+0x2c/0x80
[ 6888.200688]  schedule_timeout+0x1cf/0x350
[ 6888.200695]  ? ttwu_do_activate+0x7a/0x90
[ 6888.200696]  ? get_work_pool+0x40/0x40
[ 6888.200697]  wait_for_completion+0xba/0x140
[ 6888.200699]  ? wake_up_q+0x80/0x80
[ 6888.200700]  flush_work+0x126/0x1e0
[ 6888.200702]  ? worker_detach_from_pool+0xa0/0xa0
[ 6888.200704]  __cancel_work_timer+0x131/0x1b0
[ 6888.200706]  ? exact_lock+0x11/0x20
[ 6888.200708]  cancel_delayed_work_sync+0x13/0x20
[ 6888.200709]  disk_block_events+0x78/0x80
[ 6888.200712]  __blkdev_get+0x69/0x4c0
[ 6888.200713]  ? __follow_mount_rcu.isra.26+0x6e/0xf0
[ 6888.200715]  blkdev_get+0x129/0x320
[ 6888.200717]  blkdev_open+0x95/0xf0
[ 6888.200719]  do_dentry_open+0x1c2/0x310
[ 6888.200720]  ? __inode_permission+0x5b/0x160
[ 6888.200722]  ? bd_acquire+0xd0/0xd0
[ 6888.200723]  vfs_open+0x4f/0x80
[ 6888.200725]  path_openat+0x66e/0x1770
[ 6888.200727]  ? _copy_to_user+0x26/0x30
[ 6888.200729]  ? move_addr_to_user+0xc0/0xe0
[ 6888.200731]  do_filp_open+0x9b/0x110
[ 6888.200733]  ? __check_object_size+0xaf/0x1b0
[ 6888.200735]  ? __alloc_fd+0x46/0x170
[ 6888.200737]  do_sys_open+0x1bb/0x2c0
[ 6888.200739]  ? do_sys_open+0x1bb/0x2c0
[ 6888.200741]  SyS_openat+0x14/0x20
[ 6888.200743]  do_syscall_64+0x73/0x130
[ 6888.200745]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 6888.200745] RIP: 0033:0x7f58aabf3c8e
[ 6888.200746] RSP: 002b:00007ffcb2fdfa40 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[ 6888.200747] RAX: ffffffffffffffda RBX: 000055a3414e1b50 RCX: 00007f58aabf3c8e
[ 6888.200748] RDX: 00000000000a0800 RSI: 000055a3414e4f50 RDI: 00000000ffffff9c
[ 6888.200749] RBP: 000000000000000e R08: 000055a33fda1ba6 R09: 000055a341674da0
[ 6888.200750] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 6888.200750] R13: 000055a3414f74e0 R14: 000055a3414f74e0 R15: 00007ffcb2fdfb5c
red-scorp commented 6 years ago

For reference: I had many problems with 3008 and zfs. what helped a little bit are following kernel command line options: mpt3sas.msix_disable=1 mpt3sas.missing_delay=60,60. the missing_delay is give in seconds. But it didn't help completely and it seams driver mpt3sas v 17.xxx is simply broken.

d-helios commented 6 years ago

it seams driver mpt3sas v 17.xxx is simply broken.

I compiled the latest version available at broadcom

root # modinfo  mpt3sas|grep version
version:        26.00.00.00
srcversion:     06000312D4E7AF494803C08

the same result :

[  197.531899] print_req_error: I/O error, dev sdg, sector 544
[  197.538177] device-mapper: multipath: Failing path 8:96.
[  197.637029] sd 5:0:16:0: [sdr] tag#8484 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  197.637035] sd 5:0:16:0: [sdr] tag#8484 Sense Key : Aborted Command [current]
[  197.637039] sd 5:0:16:0: [sdr] tag#8484 Add. Sense: Nak received
[  197.637043] sd 5:0:16:0: [sdr] tag#8484 CDB: Read(10) 28 00 00 00 00 44 00 00 1c 00
[  197.637046] print_req_error: I/O error, dev sdr, sector 544
[  197.643315] device-mapper: multipath: Failing path 65:16.
[  198.980163] sd 5:0:29:0: [sdad] tag#5976 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
....
....
[  215.586374] device-mapper: multipath: Failing path 8:128.
[  215.586420] device-mapper: multipath: Failing path 8:80.
[  215.586437] sd 5:0:1:0: [sdc] Unaligned partial completion (resid=1024, sector_sz=4096)
[  215.586442] device-mapper: multipath: Failing path 8:32.
[  216.536319] device-mapper: multipath: Failing path 8:48.
[  216.713123] device-mapper: multipath: Reinstating path 65:176.
[  216.714866] device-mapper: multipath: Reinstating path 66:0.
[  216.715650] device-mapper: multipath: Reinstating path 66:32.
[  216.716285] device-mapper: multipath: Reinstating path 8:64.
[  217.717088] device-mapper: multipath: Reinstating path 65:144.
[  217.717552] device-mapper: multipath: Reinstating path 65:224.
[  217.717872] device-mapper: multipath: Reinstating path 66:16.
[  218.722517] device-mapper: multipath: Reinstating path 8:112.
[  219.724357] device-mapper: multipath: Reinstating path 65:160.
[  219.727924] device-mapper: multipath: Reinstating path 8:128.
[  219.728559] device-mapper: multipath: Reinstating path 8:144.
[  220.729821] device-mapper: multipath: Reinstating path 8:32.
[  220.730491] device-mapper: multipath: Reinstating path 8:48.
[  220.730781] device-mapper: multipath: Reinstating path 8:80.
[  220.991361] scsi_io_completion: 7 callbacks suppressed
[  220.991370] sd 5:0:0:0: [sdb] tag#8796 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  220.991377] sd 5:0:0:0: [sdb] tag#8796 Sense Key : Aborted Command [current]
[  220.991382] sd 5:0:0:0: [sdb] tag#8796 Add. Sense: Nak received
[  220.991387] sd 5:0:0:0: [sdb] tag#8796 CDB: Read(10) 28 00 09 74 db 15 00 00 02 00
[  220.991389] print_req_error: 7 callbacks suppressed
[  220.991392] print_req_error: I/O error, dev sdb, sector 1269225640
[  220.998352] device-mapper: multipath: Failing path 8:16.
[  225.750668] device-mapper: multipath: Reinstating path 8:16.
[  407.006320] mpt3sas_cm0: log_info(0x3112010a): originator(PL), code(0x12), sub_code(0x010a)
[  407.006329] sd 5:0:41:0: [sdap] tag#7252 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[  407.006336] sd 5:0:41:0: [sdap] tag#7252 CDB: Test Unit Ready 00 00 00 00 00 00

multipath.conf

blacklist {
}

defaults {
        find_multipaths yes
        user_friendly_names no
        skip_kpartx yes
}

devices {
        device {
            vendor                  "TOSHIBA"
            product                 "*"
            features                "0"
            no_path_retry           "1"
            path_grouping_policy    "multibus"
            path_selector           "round-robin 0"
            path_checker            "directio"
            hardware_handler        "0"
            prio                    "const"
            failback                "manual"
            rr_weight               "uniform"
            rr_min_io               "128"
        }

        device {
            vendor                  "HGST"
            product                 "*"
            features                "0"
            no_path_retry           "1"
            path_grouping_policy    "multibus"
            path_selector           "round-robin 0"
            path_checker            "directio"
            hardware_handler        "0"
            prio                    "const"
            failback                "manual"
            rr_weight               "uniform"
            rr_min_io               "128"
        }
}
kobuki commented 6 years ago

I also have a SAS2008 card in a VM using PCI passthrough, and kernels before 3.x didn't exhibit this problem. The issue escalated to a level where I regularly have drives failed by ZFS. Even if it's not a ZFS fault, there should be a way, like a setting to raise error thresholds for failing drives or ignoring certain kinds of errors. Needless to say, the failed drives and the pool are working perfectly after a "zpool clear poolname". I might be forced to get a card with a chip that works with ZFS and sleeping drives if this doesn't get resolved one way or another, but they tend to be a lot more expensive.

EDIT: I tried "mpt3sas.msix_disable=1 mpt3sas.missing_delay=60,60" in the kernel command line, it doesn't help.

cobrafast commented 6 years ago

If you're going to change controllers, I havn't had issues with sleeping and waking drives with a Dell PERC H310 (which is also a SAS2008 based card) flashed to LSI 9211-8i and the kernel modification I described in https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-336567646 since then. Two of these controllers in parallel also seem to work fine like this. Other than that I built the kernel (4.16) with Arch's default config.

What's weird though is that other people who have tried this mod reported back still having problems, so I'm really not sure what's different about my setup than other people's.
There are a couple ways I can think of this could happen:

If you feel adventurous and want to try compiling your own and then debugging the kernel, you can just spam dump_stack(); pretty much anywhere throughout the source code, especially around SCSI related stuff. printk(...); should also be available pretty much everywhere. These stack traces and prints will show up in dmesg. A good starting point is drivers/scsi/scsi_logging.c.

If there's anything I can try on my system to try and pinpoint the difference let me know.

chinesestunna commented 5 years ago

Finally after troubleshooting and Googling for days I found this thread and I'm glad it's not hardware issues. I started encountering those random disk read i/o errors on drive wake recently after updating my base setup and could not for life of me figure out the issue. Old Config (changed out in bold): Supermicro X8DTE motherboard + Dual L5630 Xeons Dell H200 flashed to generic LSI 2008 IT mode Intel 24 port SAS expander array0: 10x 3TB Toshiba drives array1: 8x 2TB Hitachi drives ESXi 5.5 running a Debian 7 based NAS VM, SAS controller assigned using direct PCIe passthrough

This has been running for almost 4 years now without any issues, stable as a rock. The drives should sleep after 1 hour and array would wake up when needed, nice and snappy too.

New Config (changed out in bold): Supermicro X9DRD-LN7F motherbard + E5-2670 Xeon onboard LSI 2308 SAS controller, flashed to P20 firmware IT mode Intel 24 port SAS expander array0: 10x 3TB Toshiba drives array1: 8x 2TB Hitachi drives ESXi 6.7 running a Debian 9 based NAS VM, SAS controller assigned using direct PCIe passthrough

LSI 2008 and 2308 seems almost identical other than PCIe 2.0 upgraded to 3.0, as one can see the base storage system was not really changed and I spent weeks troubleshooting "hardware issues".

  1. Tried disabling drive staggered spin up on controller
  2. Check SCTERC values correctly supported on each drive
  3. Increased drive timeout values on LSI 2308 controller BIOS (old Dell H200 LSI 2008 didn't even have BIOS to set this)
  4. Switched back to Dell H200 as SAS controller connected to expander

After a drive drop event (usually with a "/dev/sdX: i/o error on sector 214501" which would freak me out thinking a drive is going bad. I would run full badblock test and smart short/long test on it that would pass with clean bill of health, leaving me scratching my head, in fact the drop that drops seems a bit random.

Going to start testing the solution by @cobrafast and see if I make any head way. Thanks everyone for all the insights

chinesestunna commented 5 years ago

Anyone try mpt3sas version 25 drivers? Seem merged into kernel line here: https://github.com/torvalds/linux/commit/f6972d7180909787a324b83bf3f2f0686f22286a

Broadcom/Avage seems to have released P26 build of mpt3sas drivers and released on their site

chinesestunna commented 5 years ago

Hmmm... I'm a bit stumped here, I've just attempted P26 kernel module install from the Broadcom/Avago package: https://www.broadcom.com/products/storage/host-bus-adapters/sas-9300-8e#downloads

Here's the install:

root@openmediavault:~/lsidriver/debian/rpms-1# dpkg -i mpt3sas-26.00.00.00-1_Debian8.0.amd64.deb
Selecting previously unselected package mpt3sas.
(Reading database ... 49001 files and directories currently installed.)
Preparing to unpack mpt3sas-26.00.00.00-1_Debian8.0.amd64.deb ...
pre 26.00.00.00
Unpacking mpt3sas (26.00.00.00-1) ...
Setting up mpt3sas (26.00.00.00-1) ...
post 26.00.00.00
post Install Done.

Verifying the install:

root@openmediavault:~/lsidriver/debian/rpms-1# dpkg -s mpt3sas
Package: mpt3sas
Status: install ok installed
Priority: extra
Section: alien
Installed-Size: 4049
Maintainer: root <root@debian80x8664>
Architecture: amd64
Version: 26.00.00.00-1
Description: LSI MPT Fusion drivers for SAS 3.0
 Drivers for (i686, x86_64 and updates) for the
 LSI Corporation MPT Fusion Architecture parts.
 .
 (Converted from a rpm package by alien version 8.92.)

Seems good, we're at version 26, but on reboot:

root@openmediavault:~/lsidriver/debian/rpms-1# modinfo mpt3sas
filename:       /lib/modules/4.16.0-0.bpo.2-amd64/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
alias:          mpt2sas
version:        17.100.00.00
license:        GPL
description:    LSI MPT Fusion SAS 3.0 Device Driver
author:         Avago Technologies <MPT-FusionLinux.pdl@avagotech.com>
srcversion:     CC159339243D2AF95E0DD16

System seems to still load the version 17 drivers from somewhere, modinfo mpt3sas has not changed at all

MyPod-zz commented 5 years ago

"from somewhere" is "from linux-image-amd64", I don't think something else would install under /lib/modules//kernel. You may want to check where the new kmod is being installed and the mechanisms made available to provide and use an updated version without removing it.

kobuki commented 5 years ago

@chinesestunna: did you do an update-initramfs -u after installing the new module?

chinesestunna commented 5 years ago

@MyPod Seems another user had issues installing but they had errors during the install, manually copied the files and got it to work: https://askubuntu.com/questions/891281/how-to-correctly-install-update-a-deb-3rd-party-driver-on-ubuntu-16-04-2-tls-se Not sure if this is what you're referring to? I have not resorted to this approach as I didn't get errors during install from the .deb package from Broadcom @kobuki yep sure did, twice and rebooted twice

Did a bit more digging, I was curious if the version 26.00.00.00-1 only applies to mpt3sas, so I dug into the source code of the package, it clearely shows in mpt3sas_base.h that the mpt2sas component should be version 20.00.03.00, although not 26.xxxxxx, definitely seems shouldn't be 17.xxxxxx either

/* mpt2sas driver versioning info */
#define MPT2SAS_DRIVER_NAME     "mpt2sas"
#define MPT2SAS_DESCRIPTION "LSI MPT Fusion SAS 2.0 Device Driver"
#define MPT2SAS_DRIVER_VERSION      "20.00.03.00"
#define MPT2SAS_MAJOR_VERSION       20
#define MPT2SAS_MINOR_VERSION       0
#define MPT2SAS_BUILD_VERSION       3
#define MPT2SAS_RELEASE_VERSION     0
MyPod-zz commented 5 years ago

I was expecting some issue of that form, where either the driver doesn't support the kernel you are running or is installed in a directory where it won't have precedence over the kernel-provided one.

chinesestunna commented 5 years ago

Got further, @MyPod I believe you are on the right path as the .deb packages in Broadcom's ZIP file only contains compiled modules for other kernel builds (3.x.x) for Debian and Ubuntu (4.5.xx), my kernel running OpenMediaVault based on Debian 9 is currently at 4.16.0-0.bpo.2-amd64 kernel. It seems that may be why it's not working :( I'm trying to get things setup to manually compile the mpt3sas.ko file for my system now

splitice commented 5 years ago

@chinesestunna Any progress to report?

I've been running mine with increased timeouts in the bios which provides a slight improvement. Eventually however the errors still occur and the drives drop.

chinesestunna commented 5 years ago

2 quick updates: 1) The VM I'm running is OpenMediaVault v4, based on Debian 9. Unfortunately it's running a non-generic kernel and I can't for the life of my get the driver to compile. It's Linux 4.16.0-0.bpo.2-amd64 and the ./compile.sh threw errors I couldn't troubleshoot. I spun up a basic Debian 9 build, which had 4.09 Kernel running P17 module and was able to compile and update to P26 (from Broadcom) just to test compile. 2) Just as I stopped working on this, OMW updated kernel to Linux 4.17.0-0.bpo.1-amd64 and I installed. So far I still see those "device reset: FAILED" dmesg entries and basically the 2308 controller would reset the expander + power on both arrays (18 drives), but no more "CDR read error sector XYZ" on devices so far like before. The array spin up also seem much faster... Still on P17 module though

splitice commented 5 years ago

I've compiled the newest driver for my Ubuntu 16.04 (4.15.0-15-generic #16~16.04.1-Ubuntu SMP). I

root@SplitNAS:~# modinfo mpt3sas
filename:       /lib/modules/4.15.0-15-generic/kernel/drivers/scsi/mpt3sas/mpt3sas.ko
alias:          mpt2sas
version:        26.00.00.00

I'll keep an eye on it.

crimp42 commented 5 years ago

My recollection was that the issue was not based on the version of mpt3sas (or mpt2sas for that matter) but the kernel.

I think I tested the same version of mptXsas on a 2.x kernel vs 3.x kernel, and the issue occurred with the same version on 3.x but not 2.x.

On Fri, Aug 17, 2018 at 1:54 AM, Mathew Heard notifications@github.com wrote:

I've compiled the newest driver for my Ubuntu 16.04 (4.15.0-15-generic #16 https://github.com/zfsonlinux/zfs/issues/16~16.04.1-Ubuntu SMP). I

root@SplitNAS:~# modinfo mpt3sas filename: /lib/modules/4.15.0-15-generic/kernel/drivers/scsi/mpt3sas/mpt3sas.ko alias: mpt2sas version: 26.00.00.00

I'll keep an eye on it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-413775260, or mute the thread https://github.com/notifications/unsubscribe-auth/AEegdthiDv-I3RPVFajmOajGX9u7yE-lks5uRmi8gaJpZM4IpYmi .

chinesestunna commented 5 years ago

Yes I agree that's most likely the main culprit but we don't have many options at this point. 2.x kernels are EOL and I noticed your tests were done with up to P20 of mpt2sas so figured try and give it a shot. I've also increased device timeout to 60 sec on both the controller and in Linux so perhaps that's helping? It's only been a few days and I'll report back after a couple of weeks

crimp42 commented 5 years ago

Yes definitely give it some time, I do believe that a few different things I have tried based on this thread has changed the occurrence rate of this issue. I would say now, compared to a year ago I have the issue in the logs occurring at 1/10 the frequency. But it still occurs.

If I remember correctly, I believe this is what led to allot less of the errors for me: GRUB_CMDLINE_LINUX_DEFAULT="mpt3sas.msix_disable=1"

And we know if you just don't let the drives fall asleep, then the issue never occurs. (Not acceptable to me)

On Fri, Aug 17, 2018 at 11:55 AM, chinesestunna notifications@github.com wrote:

Yes I agree that's most likely the main culprit but we don't have many options at this point. 2.x kernels are EOL and I noticed your tests were done with up to P20 of mpt2sas so figured try and give it a shot. I've also increased device timeout to 60 sec on both the controller and in Linux so perhaps that's helping? It's only been a few days and I'll report back after a couple of weeks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-413926259, or mute the thread https://github.com/notifications/unsubscribe-auth/AEegdsiwCt1kk6TOeqJKWJwSopmOfS1bks5uRvV-gaJpZM4IpYmi .

chinesestunna commented 5 years ago

Have you had drive drops from array at all? My unscientific observation is there's 2 types of errors that is happening:

The wake up errors are what seems to be causing drive drops for me, interestingly read errors haven't surfaced really either after 4.17 update. Just some anecdotes

d-helios commented 5 years ago

https://github.com/zfsonlinux/zfs/issues/4713#issuecomment-404237782

I changed multipath policy and problem is gone.

blacklist {
}

defaults {
        find_multipaths         "yes"
        user_friendly_names     "no"
        skip_kpartx             "yes"
        features                "0"
        no_path_retry           "1"
        path_grouping_policy    "failover"
        path_selector           "service-time 0"
        uid_attribute           "ID_SERIAL"
        path_checker            "tur"
}

devices {
        device {
            vendor                  "TOSHIBA"
            product                 "PX0[4-5]*"

        }

        device {
            vendor                  "HGST"
            product                 "HUC*"
            rr_min_io               "128"
        }
}

Kernel parameters

BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=/dev/mapper/znstor8--n1--vg-root ro console=tty1 console=ttyS0,115200 dm_mod.use_blk_mq=y scsi_mod.use_blk_mq=y transparent_hugepage=never processor.max_cstate=1 udev.children-max=32 mpt3sas.msix_disable=1
kobuki commented 5 years ago

@d-helios what did you have to change exactly? Would it help if i'm not utilizing multipath at all?

red-scorp commented 5 years ago

@d-helios please provide more detailed information about your solution. thanks in advance!

chinesestunna commented 5 years ago

@d-helios did that also fix issues with random ATA read errors? So far I haven't had drops but still get random read errors like [Sun Aug 19 00:53:06 2018] print_req_error: I/O error, dev sde, sector 122132520 [Sun Aug 19 12:06:08 2018] sd 0:0:0:0: [sda] tag#2695 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK

red-scorp commented 5 years ago

@chinesestunna try the second kernel argument I've mentioned in above if it helps you

chinesestunna commented 5 years ago

@red-scorp Thanks man! Just had a string of errors and a drive drop after a few days of quiet so seems my issue is still there. I've added the boot flag you mentioned and will give it a try. None of my errors mention multipath nor do I have multipath running so I'm not going to mess with that for now

chinesestunna commented 5 years ago

Unfortunately even with mpt3sas.msix_disable=1 I just had another drop :( My system does not have multipath configured so that that shouldn't have an impact, there's not even a multipath.conf file