openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.67k stars 1.76k forks source link

the return of "Unaligned write command" errors #10094

Open dd1dd1 opened 4 years ago

dd1dd1 commented 4 years ago

Reporting an unusual situation. Have ZFS mirror array across two 1TB SSDs. It regularly spews "Unaligned write command" errors. From reading reports here and elsewhere, this problem used to exist, was fixed years ago, not supposed to happen today. So, a puzzle.

It turns out that the two SSDs report different physical sector size, one reports 512 bytes, one reports 4096 bytes. Same vendor, same model, same firmware. (WTH?!?)

zpool reports default ashift 0 (autodetect_. zdb reports ashift 12 (correct for 4096 sectors)

So everything seems to be correct, but the errors are there.

The "unaligned write command" errors only some from the "4096 bytes" SSD. After these write errors, "zpool scrub" runs without errors ("repaired 0B"). Two other zfs mirror pools on the same machine run without errors (2x10TB and 2x6TB disks, all report 4096 physical sector size).

K.O.

G8EjlKeK7CwVQP2acz2B commented 2 years ago

Message ID: @.***> I have two disks out of 13 that are repeatably plagued with this error -- one WD30EFRX, one a brand new WD30EFZX.  The other 11 seem to operate reliably in both resilver and scrub.  Go figure...

Currently using dual 88SE9230 PCIe 4 port SATA cards (started with one and the onboard AMD 400 series chipset).  I tried changing to a 9207-8i controller -- based on reports of success with lsilogic -- mainly to test whether a different driver would not have the issue -- and that was a disaster, could not get Ubuntu 20.04 to even reliably recognize the presence of all the disks.

Tried nocq in the ata driver, no impact on the bug.

I'm sorely tempted to dive into the driver and at least try to log the sata command history to get some info about what exactly is going south, but I am in the middle of a house remodel and don't have time for all this.

georgewhewell commented 2 years ago

I am also seeing these same errors on X10SDV, 4x HDD connected to first 4 sata ports- no problem. 2x Samsung 870 QVO- errors occur after heavy io. They were occurring immediately after installation, so it seems unlikely that i received two defective disks. Things that did not help:

livelace commented 2 years ago

@georgewhewell

Volumes/datasets ? Compression on (type, if on) ?

georgewhewell commented 2 years ago

@georgewhewell

Volumes/datasets ? Compression on (type, if on) ?

just dataset, no compression-

$ sudo zfs get -o "all"
fpool/root/Home                                      type                  filesystem                    -         -
fpool/root/Home                                      creation              Fri Apr  2 13:30 2021         -         -
fpool/root/Home                                      used                  2.83T                         -         -
fpool/root/Home                                      available             10.5T                         -         -
fpool/root/Home                                      referenced            2.82T                         -         -
fpool/root/Home                                      compressratio         1.00x                         -         -
fpool/root/Home                                      mounted               yes                           -         -
fpool/root/Home                                      quota                 none                          -         default
fpool/root/Home                                      reservation           none                          -         default
fpool/root/Home                                      recordsize            128K                          -         default
fpool/root/Home                                      mountpoint            legacy                        legacy    received
fpool/root/Home                                      sharenfs              off                           -         default
fpool/root/Home                                      checksum              on                            -         default
fpool/root/Home                                      compression           off                           -         default
fpool/root/Home                                      atime                 on                            -         default
fpool/root/Home                                      devices               on                            -         default
fpool/root/Home                                      exec                  on                            -         default
fpool/root/Home                                      setuid                on                            -         default
fpool/root/Home                                      readonly              off                           -         default
fpool/root/Home                                      zoned                 off                           -         default
fpool/root/Home                                      snapdir               hidden                        -         default
fpool/root/Home                                      aclmode               discard                       -         default
fpool/root/Home                                      aclinherit            restricted                    -         default
fpool/root/Home                                      createtxg             1485157                       -         -
fpool/root/Home                                      canmount              on                            -         default
fpool/root/Home                                      xattr                 on                            -         default
fpool/root/Home                                      copies                1                             -         default
fpool/root/Home                                      version               5                             -         -
fpool/root/Home                                      utf8only              off                           -         -
fpool/root/Home                                      normalization         none                          -         -
fpool/root/Home                                      casesensitivity       sensitive                     -         -
fpool/root/Home                                      vscan                 off                           -         default
fpool/root/Home                                      nbmand                off                           -         default
fpool/root/Home                                      sharesmb              off                           -         default
fpool/root/Home                                      refquota              none                          -         default
fpool/root/Home                                      refreservation        none                          -         default
fpool/root/Home                                      guid                  1183886347832676596           -         -
fpool/root/Home                                      primarycache          all                           -         default
fpool/root/Home                                      secondarycache        all                           -         default
fpool/root/Home                                      usedbysnapshots       16.1G                         -         -
fpool/root/Home                                      usedbydataset         2.82T                         -         -
fpool/root/Home                                      usedbychildren        0B                            -         -
fpool/root/Home                                      usedbyrefreservation  0B                            -         -
fpool/root/Home                                      logbias               latency                       -         default
fpool/root/Home                                      objsetid              100886                        -         -
fpool/root/Home                                      dedup                 off                           -         default
fpool/root/Home                                      mlslabel              none                          -         default
fpool/root/Home                                      sync                  disabled                      disabled  received
fpool/root/Home                                      dnodesize             legacy                        -         default
fpool/root/Home                                      refcompressratio      1.00x                         -         -
fpool/root/Home                                      written               0                             -         -
fpool/root/Home                                      logicalused           2.81T                         -         -
fpool/root/Home                                      logicalreferenced     2.80T                         -         -
fpool/root/Home                                      volmode               default                       -         default
fpool/root/Home                                      filesystem_limit      none                          -         default
fpool/root/Home                                      snapshot_limit        none                          -         default
fpool/root/Home                                      filesystem_count      none                          -         default
fpool/root/Home                                      snapshot_count        none                          -         default
fpool/root/Home                                      snapdev               hidden                        -         default
fpool/root/Home                                      acltype               off                           -         default
fpool/root/Home                                      context               none                          -         default
fpool/root/Home                                      fscontext             none                          -         default
fpool/root/Home                                      defcontext            none                          -         default
fpool/root/Home                                      rootcontext           none                          -         default
fpool/root/Home                                      relatime              on                            -         temporary
fpool/root/Home                                      redundant_metadata    all                           -         default
fpool/root/Home                                      overlay               on                            -         default
fpool/root/Home                                      encryption            off                           -         default
fpool/root/Home                                      keylocation           none                          -         default
fpool/root/Home                                      keyformat             none                          -         default
fpool/root/Home                                      pbkdf2iters           0                             -         default
fpool/root/Home                                      special_small_blocks  0                             -         default
fpool/root/Home                                      nixos:shutdown-time   Fri  3 Jun 12:34:55 BST 2022  -         inherited from fpool
livelace commented 2 years ago

I see. Anyway, I think - this is a Samsung problem (some specific internals). I did have problems only with this vendor. I use ZFS with Adaptect RAID, Hynix, Seagate, WD, Intel, LVM, LUKS etc. And problems only with Samsung.

stuckj commented 2 years ago

It's not a Samsung problem. Several of us (including me) saw it with other devices. Though Samsung has certainly had SSD firmware issues. I also have an X10SDV @georgewhewell. Updating to the latest proxmox 7 so far has fixed it for me. At least, I haven't seen it since updating a few months ago. I'm running 7.2-4 now (this kernel specifically: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15). Not sure if you're using proxmox or not. If not, then perhaps just updating to a later 5.x kernel will help you on whatever distro you're on.

bddali commented 2 years ago

Ran into this issue a couple of days of go on Rocky Linux 8.6 with ZFS 2.1.5.

Migrated data off LVMRAID, created a ZFS raidz1 on the same disks and rsynced the data back. During rsync one of the disks was marked FAULTED by ZFS.

Smartctl didn't indicate any hardware errors. Ran "zpool clear", during resilver all disks in the vdev had CKSUM errors.

Checked dmesg, all disks in the vdev reported read or write issues. Seemed unlikely that all drives were going bad at the same time, also unlikely that all SATA cables were faulty.

Dmesg also contained reports of "Unaligned write command" and Google led me to this issue, https://github.com/openzfs/zfs/issues/10094#issuecomment-623603031 had the clue I needed:

echo max_performance | sudo tee /sys/class/scsi_host/host*/link_power_management_policy

I use the tuned profile "balanced" in Rocky Linux, turns out it sets ALPM to "medium_power", changed that to "max_performance".

Restarted rsync, after completion (no more errors) I have run multiple scrubs with zero errors. Disabling link power management seems to have solved the problem for me.

EDIT

Hardware used:

AMD Ryzen 5700X CPU Gigabyte X570S UD motherboard Kingston KSM32ED8 ECC memory Western Digital WD Red Plus WD40EFZX disks

artw commented 2 years ago

I fed up with it and replaced my Samsung SSD 850 PRO 1TB with WDS200T2B0A-00SM50, same cables and all. Did a full scrub, some random write benchmarks to be sure. And guess what, no more fails. I guess it's something in the Samsung firmware that does not like what ZFS is specifically doing with it. Because the drive is fine and takes whatever fio shenanigans I do on the raw block device. It is now happily serving as an external drive for PS5

artw commented 2 years ago

I use the tuned profile "balanced" in Rocky Linux, turns out it sets ALPM to "medium_power", changed that to "max_performance".

Apparently this is a must for ZFS. At least with SATA. This should be documented

stuckj commented 2 years ago

I use the tuned profile "balanced" in Rocky Linux, turns out it sets ALPM to "medium_power", changed that to "max_performance".

Apparently this is a must for ZFS. At least with SATA. This should be documented

Agreed, though this still doesn't explain the problem for those of us who did that early on (or don't even have Samsung devices) and saw the same problems on multiple SSDs simultaneously (brand new ones). Happily, I still haven't seen problems since updating to proxmox 7.1...hoping I just haven't been lucky for now. :-P

ykun92 commented 2 years ago

I have meet the same issue. Runing 3 servers , each one have 2TB Crucial MX500 SDD x 2 running in RAID0 via mdadm. I am not using ZFS, but the same Sense: Unaligned write command error comes out randomly.

the OS is Ubuntu Server 20.04 with Kernel 5.15.0-46-generic

kernel: [269471.003856] ata2.00: exception Emask 0x10 SAct 0x20000000 SErr 0x2c0100 action 0x6 frozen
kernel: [269471.003887] ata2.00: irq_stat 0x08000000, interface fatal error
kernel: [269471.003896] ata2: SError: { UnrecovData CommWake 10B8B BadCRC }
kernel: [269471.003916] ata2.00: failed command: READ FPDMA QUEUED
kernel: [269471.003926] ata2.00: cmd 60/00:e8:00:0d:04/02:00:00:00:00/40 tag 29 ncq dma 262144 in
kernel: [269471.003926]          res 40/00:ec:00:0d:04/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
kernel: [269471.003951] ata2.00: status: { DRDY }
kernel: [269471.003965] ata2: hard resetting link
kernel: [269471.479725] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
kernel: [269471.480710] ata2.00: supports DRM functions and may not be fully accessible
kernel: [269471.482862] ata2.00: supports DRM functions and may not be fully accessible
kernel: [269471.483668] ata2.00: configured for UDMA/133
kernel: [269471.493807] sd 1:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: [269471.493810] sd 1:0:0:0: [sda] tag#29 Sense Key : Illegal Request [current]
kernel: [269471.493812] sd 1:0:0:0: [sda] tag#29 Add. Sense: Unaligned write command
kernel: [269471.493814] sd 1:0:0:0: [sda] tag#29 CDB: Read(10) 28 00 00 04 0d 00 00 02 00 00
kernel: [269471.493814] blk_update_request: I/O error, dev sda, sector 265472 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
kernel: [269471.493833] ata2: EH complete

But a strangely thing in my servers is, this error ONLY occurs in /dev/sda of each server.

In my environment, everytime this error happens, it will increase the number of UDMA_CRC_Error_Count in SMART. Server Device UDMA CRC ERROR
server-1 /dev/sda 109
/dev/sdb 0
server-2 /dev/sda 4
/dev/sdb 0
server-3 /dev/sda 1
/dev/sdb 0

So it only occurs on /dev/sda (the difference between servers is caused by the running time server1 >> server2 > server3)

I have completely no idea to solve this problem. I have tried swap sata cables, not work.

I will have a try to set link_power_management_policy to max_performance and see the result.

And by the way, following changes it not permanent, it will disappear after reboot

echo max_performance | sudo tee /sys/class/scsi_host/host*/link_power_management_policy

for permanent change add a file to /etc/udev/rules.d and name it like 60-scsi.rules and edit the content like following

KERNEL=="host[0-2]", SUBSYSTEM=="scsi_host", ATTR{link_power_management_policy}="max_performance"
stuckj commented 2 years ago

Do you know what SATA controller(s) your servers use? Given that it's not ZFS at all (and assuming it's not a hardware problem on all three) that seems to point to something in the kernel such as the controller.

ykun92 commented 2 years ago

Here you are

lspci result

$ sudo lspci -v -s 05:00.0
05:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81) (prog-if 01 [AHCI 1.0])
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
        Flags: bus master, fast devsel, latency 0, IRQ 38
        Memory at fcd01000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=1/2 Maskable- 64bit+
        Capabilities: [d0] SATA HBA v1.0
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [270] Secondary PCI Express
        Capabilities: [400] Data Link Feature <?>
        Capabilities: [410] Physical Layer 16.0 GT/s <?>
        Capabilities: [440] Lane Margining at the Receiver <?>
        Kernel driver in use: ahci
        Kernel modules: ahci

$ sudo lspci -v -s 05:00.1
05:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81) (prog-if 01 [AHCI 1.0])
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
        Flags: bus master, fast devsel, latency 0, IRQ 45
        Memory at fcd00000 (32-bit, non-prefetchable) [size=2K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
        Capabilities: [64] Express Endpoint, MSI 00
        Capabilities: [a0] MSI: Enable+ Count=2/2 Maskable- 64bit+
        Capabilities: [d0] SATA HBA v1.0
        Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Kernel driver in use: ahci
        Kernel modules: ahci

I think this is a SATA Controller embedded in Ryzen 5000 mobile series

and there is some other infomartion Server CPU Memory SSD SSD Firmware
server-1 AMD Ryzen 5900HX Essencore KD4BGSA8C-32N220D DDR4-3200 32GBx2 Crucial CT2000MX500SSD1 x2 M3CR033
server-2 AMD Ryzen 5900HX Essencore KD4BGSA8C-32N220D DDR4-3200 32GBx2 Crucial CT2000MX500SSD1 x2 M3CR043
server-3 AMD Ryzen 5900HX Crucial CT32G4SFD832A DDR4-3200 32GBx2 Crucial CT2000MX500SSD1 x2 M3CR045
ykun92 commented 2 years ago

Here is some of my analyses and guess

from this thread, I found that these issues have some common points

So I make a guess there maybe some issue in software RAID dealing with the ALPM. (like don't know how to deal when a disk in RAID was put into sleep by ALPM)

Anyway, I will have a look at change the ALPM to max_performance will solve this issue or not.


And the ALPM document from RedHat have some interesting infomation.

In the end of this document, it says

Setting ALPM to min_power or medium_power will automatically disable the "Hot Plug" feature.

So, if you are running RAID1, RAID5, ... and need to hot swap failed disk, you need to set ALPM to max_performance anyway.

stuckj commented 2 years ago

I had the problem with max performance set. For me, the problem only happened on SSDs (which I think was a common theme on this thread). And went away completely when not using the onboard SATA. It now also seemed to go away completely after I updated to the latest proxmox kernel. But, you're on a newer kernel than what I'm running (5.13.19-15 PVE kernel) and hit the issue.

ykun92 commented 2 years ago

Ummm... seems many factors cause this error, or you just forgot make the max performance set permanent and did some reboot :D

Will have a look for some weeks to see whether max performance setting solve the problem in my case and report here.

bjquinn commented 2 years ago

@stuckj I just want to mention that I only had the issue with HDDs and my SSDs worked fine, so I don't think that SSDs are the common factor. This was on half a dozen different servers.

I do agree that the problem completely went away when I abandoned the onboard SATA for an LSI HBA.

I tried a lot of things, but I'm not sure I ever tried the max performance thing. I now have the latest Proxmox, but I have already made the hardware changes with the HBA, and I'm not interested in messing with the hardware to trigger this problem again, so I don't know whether latest Proxmox would solve my problem or not. :)

ykun92 commented 2 years ago

And after a month of monitoring, I am now sure set link_power_management_policy to max_performance solved the Unaligned write command issue completely, at least in my environment.

I also confirmed this issue come back when turn link_power_management_policy back to the default value.

Just for your information.

csarn commented 2 years ago

Thanks for that information, I'm setting that right now! If I remember to do so, I can also report back if that fixed it for me. But I got the errors infrequently, so to definitely consider it fixed I'll have to wait a few months. By the way, are there better ways to set the link_power_management_policy than via a systemd oneshot service? That is what I have done.

jay-to-the-dee commented 2 years ago

Also encountered this, I set link_power_management_policy to max_performance as above, as well as setting the Drive Settings in the gnome-disks GUI to never power down and disabling NVQ. Doing so has indeed stabilised this issue, showing this is indeed a software configuration issue.

I had already ruled out the SATA controller itself as being the issue, as the same hard drive encountered this same issue, no matter whether it was plugged into the motherboard's SATA ports or the HBA card's ports.

loop-evgeny commented 2 years ago

Seems that https://bugzilla.kernel.org/show_bug.cgi?id=203475 is related: Samsung EVO 860/870 Firmware is reported to have issues with NCQ + trim.

The latest comment in that bug mentions that updating the drive firmware fixes the issue (for Samsung EVO 860/870 drives). Is it safe to do the drive firmware update without taking the whole machine down by just offlining one drive at a time with zpool offline?

Jibun-no-Kage commented 1 year ago

Very informative read, a bit long, but worth the time. I was struggling to isolate some errors I was seeing, given smartmontools smartctl was reporting ICRC ABRT errors for multiple drives, I took the test effort back to the lowest common level, and started testing, the interesting thing was, it seemed like a power level issue or a back plane board issue in my storage frame, but as it turned it it was one of the internal cables from the back plane board (1 of 2) to the eSata port transition out of the storage frame case, which is an 8-bay eSATA tower.

For those that might be interested, below is the test methodology I used...

WARNING... I DID NOT NEED TO MAINTAIN THE DATA IN THE GIVEN STORAGE FRAME.... The following steps will overwrite existing data, be careful.

I added a multiplex PCIe card/adapter, did not solve the issue. So the mainboard eSATA ports and the PCIe card/adapter both seem fine, but still getting random read/write errors. So I disabled NCQ, still errors.

Then replaced the external cables from server to storage frame, still errors.

Crossed the internal SATA cables of the back plane, since it was a split design, had two boards, each supporting 4 devices. The issue at first seem to move from back plane 1 to back plane 2, but as I did a bit more testing, I started getting reports of bad sectors on drives I believed fine, passing various smart tests, via smartmontools.

I pulled some additional drives, from my spare parts, and swapped 4 drives on back plane 1, now errors reported on the drives just swapped in. So swapped in more drives, still errors on back plane 2 as well. Did more smart tests, drives seemed fine, when used on different system no errors.

So I replaced the internal SATA cables as well, things seemed to stabilize. So then I used 'dd' to do an exhaustive random write to all sectors, first the just drives in back plane 1, no errors. About 2 hours later, still ok. That is a good sign.

Then I did the same test with the same set of now believed good drives, with the new internal SATA cables once connected to back plane 1, connected to back plane 2, even better, no errors. So now I knew the external cables, and internal cables seemed good. And, the back plane boards seem good.

After 2 hours of constant random writes to all 8 drives, still no errors. Will let the test continue for a couple more hours, but when errors occurred it only took minutes to about 1 hour to get 10s to 100s of errors across all 8 drives. This is also pushing the power supply on the storage frame, since have all 8 drives racing to slam data to sectors exhaustively. Oh, minor, but I did confirm that write cache is off during the dd tests. You might want to make sure you set write cache on or off depending on your use/test case. in my case, I need data saved, not performance, so write cache stays off.

I still need to enable NCQ, to confirm everything is completely legit. Just to do that last step of validation. Of course, setting up an mdadm RAID set, say 1 RAID 5 set per back plane, or setting up a ZFS pool per back plane or across all 8 drives would also work as a test scenario, but using 'dd' was easy. And I wanted to make sure the power supply was stable.

Even using FIO would be applicable, not that I think about it.

Why the errors, seems the internal SATA cables from back planes from external eSATA port sockets, just aged badly, the internal temperature case does it pretty warm, even with fans and venting, and the air flow, from power supply and drives, you guessed it, goes right through the internal SATA cables to escape. The case has rear fan exhaust, but not top exhaust, if it was possible I would add a vent or even better a top exhaust fan.

Hope those that find this, find it helpful.

Shellcat-Zero commented 1 year ago

I only started seeing this error after upgrading from Ubuntu 18.04 to 22.04 and rebuilding the pool. I've replaced SATA cables and some disks, and the issue persists. I'm suspecting it has something to do with how non-Solaris zfs currently handles NCQ. I stumbled across this thread which quotes this article from 2009 where :

SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.

Dynamically set the ZFS command queue to 1 to optimize for NCQ. echo zfs_vdev_max_pending/W0t1 | mdb -kw And add to /etc/system set zfs:zfs_vdev_max_pending=1 Enjoy your OpenSolaris server on cheap SATA disks!

I am not sure how to verify if what he says would be true for my install, how to change that parameter in Ubuntu/Linux, or what the optimal queue depth would be, if anyone has ideas. The OpenZFS docs on command queuing mention that zfs checks for command queue support on Linux with:

hdparm -I /path/to/device \| grep Queue

The output for my disks looks like:

        Queue depth: 32
           *    Native Command Queueing (NCQ)

I am a bit more convinced that the issue is somehow related to the above, since @bjquinn was able to solve the problem by switching to SAS-capable HBAs which presumably use a SAS-to-SATA adapter in his case, and makes me wonder how command queuing is handled with his controllers.

The author of the article also mentioned that his performance issues were mostly attributed to TLER being disabled by default, and the OpenZFS docs on error recovery control mention this as well and recommend writing a script that enables it on every boot with a low value. With the Hitachi/HGST drives that I have, none of them accept values lower than 6.5 seconds, which might be some capability requiring enterprise software from the manufacturer. I would think the capability exists with hdparm (can't find it, if so), but setting ERC with smartctl is done in units of deciseconds, if anyone else needs or wants to set this:

smartctl -l scterc,65,65 /dev/disk

I'm also starting to suspect the issue may have something to do with enterprise hardware being sold to consumers, where some features come disabled by default given a certain hardware setup, thinking of the choice of controller in @bjquinn's case. I also have some SED drives that require TCG Enterprise to access most of the features (and probably better TLER times), and the SED features are only partially available using hdparm and smartctl.

The Wikipedia page on NCQ is also informative:

NCQ can negatively interfere with the operating system's I/O scheduler, decreasing performance;[8] this has been observed in practice on Linux with RAID-5.[9] There is no mechanism in NCQ for the host to specify any sort of deadlines for an I/O, like how many times a request can be ignored in favor of others. In theory, a queued request can be delayed by the drive an arbitrary amount of time while it is serving other (possibly new) requests under I/O pressure.[8] Since the algorithms used inside drives' firmware for NCQ dispatch ordering are generally not publicly known, this introduces another level of uncertainty for hardware/firmware performance. Tests at Google around 2008 have shown that NCQ can delay an I/O for up to 1–2 seconds. A proposed workaround is for the operating system to artificially starve the NCQ queue sooner in order to satisfy low-latency applications in a timely manner.[10]

On some drives' firmware, such as the WD Raptor circa 2007, read-ahead is disabled when NCQ is enabled, resulting in slower sequential performance.[11]

For the moment I've set my onboard controllers to IDE mode from AHCI, and configured TLER as mentioned above to see if this helps at all.

Shellcat-Zero commented 1 year ago

For the moment I've set my onboard controllers to IDE mode from AHCI, and configured TLER as mentioned above to see if this helps at all.

None of that was helpful. I ended up getting an HBA card as others mentioned here previously, and the errors have been gone now for more than a month. I don't mind the investment too much, but I'd love to know technically why this has become a necessity for ZFS. I never had these issues several LTS-releases ago.

Shellcat-Zero commented 1 year ago

@bjquinn @stuckj Can either of you tell me what you used to flash your HBAs? I'm running Ubuntu 22.04 and I'm having trouble figuring out what utility to use. I have an LSI 9305-16i which has been running great until some recent kernel update bricked it, and I'm hoping a flash will fix it (HBA card currently fails to load its BIOS).

stuckj commented 1 year ago

Yikes. I flashed mine years ago. I don't recall offhand, but I believe I followed a thread on the TrueNAS forums (I was using ESXi + FreeNAS at the time). It may have been this one? https://www.truenas.com/community/threads/detailed-newcomers-guide-to-crossflashing-lsi-9211-9300-9305-9311-9400-94xx-hba-and-variants.55096/page-3

Best of luck.

bjquinn commented 1 year ago

@Shellcat-Zero

This is from my notes, though like @stuckj I haven't tried this in quite some time.

  1. Go to the downloads page for your HBA, i.e. https://www.broadcom.com/products/storage/host-bus-adapters/sas-9300-4i
  2. In the firmware section, find either Installer_P[xx]_for_UEFI or Installer_P[xx]_for_Linux, depending on whether you want to do the update from within Linux (a preinstalled Proxmox, for example) or UEFI.
  3. Extract the sas3flash (or sas3flash.efi for UEFI) binary, and in the case of Linux, make sure to get it for the proper architecture. Copy this to a USB drive (not necessarily bootable, but formatted FAT/FAT32) if you're doing UEFI, or just get it to /usr/src if doing it from within Linux.
  4. Now grab the latest 9300_4i_Package_P[xx]_IR_IT_FW_BIOS_for_MSDOS_Windows file in the firmware section. Open the archive, in the Firmware folder, select the folder with the IT firmware, copy the .bin file to the same place as the flash utility. Do the same for the BIOS .rom file in the sasbios_rel folder.
  5. Boot to UEFI or Linux.
  6. For Linux, run /usr/src/sas3flash -fwall [firmwarefilename.bin] -biosall [biosfilename.rom], or for UEFI run sas3flash.efi -fwall [firmwarefilename.bin] -biosall [biosfilename.rom]. A reboot is probably appropriate at this point. (-fwall an -biosall as opposed to -f and -b try to upgrade the firmware/bios on ALL attached controllers)
  7. Confirm that you've updated. Do dmesg | grep mpt, and look like a line like the following to see what firmware version you have: [ 1.861810] mpt3sas_cm0: LSISAS3004: FWVersion(16.00.10.00), ChipRevision(0x02), BiosVersion(08.11.00.00)
plantroon commented 1 year ago

This happened to me on a Coreboot Thinkpad T430 with 2x Crucial MX500 (one is in ultrabay) on Debian 11.

Setting link_power_management_policy to max_performance as suggested prevents the error from happening (so far, it's been fine for ~3 days). Strange issue, nonetheless.

This next paragraph is more of a speculation, as my setup is a bit exotic - ZFS mirror on a laptop with one of these drives in ultrabay: I ran Crucial MX300 disks in this very machine in the same configuration before and didn't have these issues. Also, I noticed that the errors always happened on the MX500 with firmware M3CR033 while the one with M3CR043 didn't malfunction in this way. As far as I know, there's no clear/easy upgrade path to this firmware provided by Crucial.

I really don't like experimenting around with SSDs and if MX300 were available at the time I was buying these, I'd have gone with them instead. Every one of those recently bought SSDs I owned had weird issues that required firmware updates. MX500 has this bug, Intel 660p had PCIe passthrough broken until a firmware update. Never had any issue with MX300 or MX200.

csarn commented 1 year ago

When I ran into the "unaligned write command", I also had a MX500 with firmware M3CR033 running (in a thinkpad W530). I've set link_power_management_policy to max_performance on 2022-10-02, and the problem did not return.

G8EjlKeK7CwVQP2acz2B commented 1 year ago

On Ubuntu 20.04.2 I was unable to achieve stable ZFS operation due to "unaligned" and had to move to a 9207-8i LSI HBA.  This initially did not work well at all, I think the first such card that I bought had issues, it was a apparently a cheap Asian knockoff.  I found a reputable seller who sold me a surplus HP card that is working very well.

Then, ZFS being ZFS, it found some bad sectors and a bad hotswap tray -- which is exactly why my interest in ZFS in the first place -- but everything is now fixed and my "unaligned" issues are over.

io7m commented 1 year ago

Just another "me too" here.

Three spinning rust drives connected to a consumer-grade motherboard's SATA links:

Model Family:     Seagate BarraCuda 3.5 (CMR)
Device Model:     ST1000DM010-2EP102
Serial Number:    ZN1VFRPD
LU WWN Device Id: 5 000c50 0e41db863
Firmware Version: CC46
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 19 09:16:31 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Model Family:     Toshiba P300 (CMR)
Device Model:     TOSHIBA HDWD110
Serial Number:    92MY48ANS
LU WWN Device Id: 5 000039 fc5e96dcb
Firmware Version: MS2OA9A0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 19 09:17:06 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-00BBHA0
Serial Number:    WD-WCC6Y7VL8C4H
LU WWN Device Id: 5 0014ee 26a20209f
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 19 09:17:25 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Motherboard:

Base Board Information
    Manufacturer: ASUSTeK COMPUTER INC.
    Product Name: TUF B450-PLUS GAMING

CPU:

model name  : AMD Ryzen 7 3700X 8-Core Processor

Kernel:

Linux services02 6.3.4-201.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sat May 27 15:08:36 UTC 2023 x86_64 GNU/Linux

All drives consistently pass SMART tests, but a zfs scrub would typically fail on that one Western Digital drive consistently. Errors would be logged in dmesg:

[382950.946984] ata15.00: exception Emask 0x0 SAct 0x1c80000 SErr 0xd0000 action 0x6 frozen
[382950.946992] ata15: SError: { PHYRdyChg CommWake 10B8B }
[382950.946995] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.946997] ata15.00: cmd 61/88:98:e8:13:58/00:00:1c:00:00/40 tag 19 ncq dma 69632 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947004] ata15.00: status: { DRDY }
[382950.947006] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947007] ata15.00: cmd 61/f8:b0:d0:b6:72/00:00:1f:00:00/40 tag 22 ncq dma 126976 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947013] ata15.00: status: { DRDY }
[382950.947014] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947015] ata15.00: cmd 61/28:b8:40:e6:0b/00:00:5b:00:00/40 tag 23 ncq dma 20480 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947021] ata15.00: status: { DRDY }
[382950.947022] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947023] ata15.00: cmd 61/08:c0:a0:e4:0b/00:00:5b:00:00/40 tag 24 ncq dma 4096 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947029] ata15.00: status: { DRDY }
[382950.947032] ata15: hard resetting link
[382951.406984] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[382951.409125] ata15.00: configured for UDMA/133
[382951.409147] sd 14:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s
[382951.409151] sd 14:0:0:0: [sdd] tag#19 Sense Key : Illegal Request [current] 
[382951.409154] sd 14:0:0:0: [sdd] tag#19 Add. Sense: Unaligned write command
[382951.409156] sd 14:0:0:0: [sdd] tag#19 CDB: Write(10) 2a 00 1c 58 13 e8 00 00 88 00
[382951.409158] I/O error, dev sdd, sector 475534312 op 0x1:(WRITE) flags 0x700 phys_seg 10 prio class 2
[382951.409164] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=243472519168 size=69632 flags=40080c80
[382951.409179] sd 14:0:0:0: [sdd] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s
[382951.409181] sd 14:0:0:0: [sdd] tag#22 Sense Key : Illegal Request [current] 
[382951.409183] sd 14:0:0:0: [sdd] tag#22 Add. Sense: Unaligned write command
[382951.409185] sd 14:0:0:0: [sdd] tag#22 CDB: Write(10) 2a 00 1f 72 b6 d0 00 00 f8 00
[382951.409186] I/O error, dev sdd, sector 527611600 op 0x1:(WRITE) flags 0x700 phys_seg 30 prio class 2
[382951.409189] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=270136090624 size=126976 flags=40080c80
[382951.409197] sd 14:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
[382951.409199] sd 14:0:0:0: [sdd] tag#23 Sense Key : Illegal Request [current] 
[382951.409201] sd 14:0:0:0: [sdd] tag#23 Add. Sense: Unaligned write command
[382951.409203] sd 14:0:0:0: [sdd] tag#23 CDB: Write(10) 2a 00 5b 0b e6 40 00 00 28 00
[382951.409204] I/O error, dev sdd, sector 1527506496 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 2
[382951.409207] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082277376 size=20480 flags=180880
[382951.409218] sd 14:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[382951.409220] sd 14:0:0:0: [sdd] tag#24 Sense Key : Illegal Request [current] 
[382951.409222] sd 14:0:0:0: [sdd] tag#24 Add. Sense: Unaligned write command
[382951.409223] sd 14:0:0:0: [sdd] tag#24 CDB: Write(10) 2a 00 5b 0b e4 a0 00 00 08 00
[382951.409224] I/O error, dev sdd, sector 1527506080 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 2
[382951.409227] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082064384 size=4096 flags=180880
[382951.409230] ata15: EH complete

Setting the SATA links to max_performance caused the problem to go away. The first scrub after setting the new power setting found a couple of checksum errors that were corrected.

emwgit commented 1 year ago

Same for me: Suffering from this issue for years now with various Samsung EVO SSDs (latest FW) Tried "everything" (stepwise ... until all enabled together)... still not ok. Most changes made it "better for some days or weeks" , but finally the "unaligned write command" returned.

P.S.: Found this information linking the "Unaligned Write Command" to "Zoned Block Devices":

Since I'm not an expert in this domain: Can someone in this forum comment on this? Is there a possibility that Samsung EVO SSDs exhibit this behavior of zoned block devices? How does ZFS deal with zoned block devices ? ... From the past, I have in mind that e.g. SMR devices are not suited for ZFS ... Is this still true? ... nevertheless: Only SSDs in my case....

stuckj commented 1 year ago

Just another "me too" here.

Three spinning rust drives connected to a consumer-grade motherboard's SATA links:

Model Family:     Seagate BarraCuda 3.5 (CMR)
Device Model:     ST1000DM010-2EP102
Serial Number:    ZN1VFRPD
LU WWN Device Id: 5 000c50 0e41db863
Firmware Version: CC46
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 19 09:16:31 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Model Family:     Toshiba P300 (CMR)
Device Model:     TOSHIBA HDWD110
Serial Number:    92MY48ANS
LU WWN Device Id: 5 000039 fc5e96dcb
Firmware Version: MS2OA9A0
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 19 09:17:06 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Model Family:     Western Digital Blue
Device Model:     WDC WD10EZEX-00BBHA0
Serial Number:    WD-WCC6Y7VL8C4H
LU WWN Device Id: 5 0014ee 26a20209f
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5417
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 19 09:17:25 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Motherboard:

Base Board Information
  Manufacturer: ASUSTeK COMPUTER INC.
  Product Name: TUF B450-PLUS GAMING

CPU:

model name    : AMD Ryzen 7 3700X 8-Core Processor

Kernel:

Linux services02 6.3.4-201.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sat May 27 15:08:36 UTC 2023 x86_64 GNU/Linux

All drives consistently pass SMART tests, but a zfs scrub would typically fail on that one Western Digital drive consistently. Errors would be logged in dmesg:

[382950.946984] ata15.00: exception Emask 0x0 SAct 0x1c80000 SErr 0xd0000 action 0x6 frozen
[382950.946992] ata15: SError: { PHYRdyChg CommWake 10B8B }
[382950.946995] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.946997] ata15.00: cmd 61/88:98:e8:13:58/00:00:1c:00:00/40 tag 19 ncq dma 69632 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947004] ata15.00: status: { DRDY }
[382950.947006] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947007] ata15.00: cmd 61/f8:b0:d0:b6:72/00:00:1f:00:00/40 tag 22 ncq dma 126976 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947013] ata15.00: status: { DRDY }
[382950.947014] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947015] ata15.00: cmd 61/28:b8:40:e6:0b/00:00:5b:00:00/40 tag 23 ncq dma 20480 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947021] ata15.00: status: { DRDY }
[382950.947022] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947023] ata15.00: cmd 61/08:c0:a0:e4:0b/00:00:5b:00:00/40 tag 24 ncq dma 4096 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947029] ata15.00: status: { DRDY }
[382950.947032] ata15: hard resetting link
[382951.406984] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[382951.409125] ata15.00: configured for UDMA/133
[382951.409147] sd 14:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s
[382951.409151] sd 14:0:0:0: [sdd] tag#19 Sense Key : Illegal Request [current] 
[382951.409154] sd 14:0:0:0: [sdd] tag#19 Add. Sense: Unaligned write command
[382951.409156] sd 14:0:0:0: [sdd] tag#19 CDB: Write(10) 2a 00 1c 58 13 e8 00 00 88 00
[382951.409158] I/O error, dev sdd, sector 475534312 op 0x1:(WRITE) flags 0x700 phys_seg 10 prio class 2
[382951.409164] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=243472519168 size=69632 flags=40080c80
[382951.409179] sd 14:0:0:0: [sdd] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s
[382951.409181] sd 14:0:0:0: [sdd] tag#22 Sense Key : Illegal Request [current] 
[382951.409183] sd 14:0:0:0: [sdd] tag#22 Add. Sense: Unaligned write command
[382951.409185] sd 14:0:0:0: [sdd] tag#22 CDB: Write(10) 2a 00 1f 72 b6 d0 00 00 f8 00
[382951.409186] I/O error, dev sdd, sector 527611600 op 0x1:(WRITE) flags 0x700 phys_seg 30 prio class 2
[382951.409189] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=270136090624 size=126976 flags=40080c80
[382951.409197] sd 14:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
[382951.409199] sd 14:0:0:0: [sdd] tag#23 Sense Key : Illegal Request [current] 
[382951.409201] sd 14:0:0:0: [sdd] tag#23 Add. Sense: Unaligned write command
[382951.409203] sd 14:0:0:0: [sdd] tag#23 CDB: Write(10) 2a 00 5b 0b e6 40 00 00 28 00
[382951.409204] I/O error, dev sdd, sector 1527506496 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 2
[382951.409207] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082277376 size=20480 flags=180880
[382951.409218] sd 14:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[382951.409220] sd 14:0:0:0: [sdd] tag#24 Sense Key : Illegal Request [current] 
[382951.409222] sd 14:0:0:0: [sdd] tag#24 Add. Sense: Unaligned write command
[382951.409223] sd 14:0:0:0: [sdd] tag#24 CDB: Write(10) 2a 00 5b 0b e4 a0 00 00 08 00
[382951.409224] I/O error, dev sdd, sector 1527506080 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 2
[382951.409227] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082064384 size=4096 flags=180880
[382951.409230] ata15: EH complete

Setting the SATA links to max_performance caused the problem to go away. The first scrub after setting the new power setting found a couple of checksum errors that were corrected.

I've definitely run into this issue myself, but I would make sure you don't have a bad HDD first since it can also result in this problem. It's suspicious if it's only happening for one drive.

Passing the smart self-checks doesn't mean the drive is good. If you're always getting errors on the WD drive, I would check the smart attributes for any critical attributes that have bad values. E.g., Raw_Read_Error_Rate, Read_Soft_Error_Rate, Reported_Uncorrect, Hardware_ECC_Recovered, Current_Pending_Sector, or UDMA_CRC_Error_Count.

See, e.g. https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl, for an example of how to diagnose a bad HDD.

Also, look at the power on hours in the smart attributes. Some NAS drives can last 5-7 years, but it's not a guarantee. Cheaper drives generally have a shorter lifetime (though not always). I believe WD Blue is on the cheaper side.

andoo391 commented 1 year ago

Have had the same problem. I have KNOWN GOOD drives that would have a failing SATA link, causing unaligned write errors. The problem was very noticeable when the drives were put in a ZFS pool. Non-ZFS write loads didn't seem to kill the link, weirdly. Using the built-in sata controller on both my B450 motherboards caused the same errors.

None of the fixes in this thread worked, not even the link power management settings. Maybe for Samsung drives there are fixes, but this seems to be its own issue

FIX: Use a 3rd party HBA. I threw them on an LSI 9211 and it has been rock solid for weeks, not even a hiccup. Or, use a non-AMD controller. Far too many issues already, and now this.

Motherboards: Gigabyte B450 AORUS M, MSI B450M PRO-M2 Drives: HGST HUS724040ALA640 4TB, Seagate Exos X20 18TB (ST18000NM003D-3DL103)

io7m commented 1 year ago

Passing the smart self-checks doesn't mean the drive is good. If you're always getting errors on the WD drive, I would check the smart attributes for any critical attributes that have bad values. E.g., Raw_Read_Error_Rate, Read_Soft_Error_Rate, Reported_Uncorrect, Hardware_ECC_Recovered, Current_Pending_Sector, or UDMA_CRC_Error_Count.

To be clear, I've still not seen any errors since setting the link rate.

SMART attributes look good:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   183   183   021    Pre-fail  Always       -       1808
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2529
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       104
194 Temperature_Celsius     0x0022   111   108   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
stuckj commented 1 year ago

Oh yeah, that looks like a relatively new drive.

lorenz commented 1 year ago

Just FYI the Unaligned Write error is a bug in Linux's libata:

https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/

Basically libata is not implementing the SAT (SCSI-to-ATA) translation correctly and ends up converting ATA errors it shouldn't, producing nonsense. This is still a real error (probably a link reset or similar), but the SCSI conversion is broken. This only applies if SAT is not done by a controller/storage system, but by Linux.

bmilde commented 1 year ago

@lorenz Interesting, that might explain why I've gotton the unaligned write error too.

I've tried everything from updating ssd firmware, maximum link power settings, to no trim, upgrading the linux kernel, downgrading the port speeds to 3gbps. Nothing worked, slower speed just meant it took longer for the problem to resurface.

What has likely fixed it now is a cable replacement. I noticed that only 2 drives were failing and they had a different brand of cables. The new cables are thicker too, so probably they have much better shielding. Definitely try to replace your sata cable if you're seeing this error!

Side note: zfs didn't handle random writes erroring out gently, the pool was broken beyond repair and I had to rebuild from backups.

meyergru commented 1 year ago

I am on the same boat here. System is an X570-based mainboard with FCH controllers plus an ASMEDIA ASM1166. The drivers are 8 WD Utrastar 18 TB, 6 of which are known good because they ran on an ARECA RAID controller for 2 years with no problems. The drives are in MBP155 cages, which certainly give another potential contact point failure. Everything running under Proxmox 8.0.4 with Kernel 6.2.x.

I had ZFS errors galore and tried:

  1. Exchanging SATA cables - no dice.
  2. Suspected that the ASM1166 is problematic. Ordered a JMB585, not yet arrived.
  3. Limiting SATA speed to 3 GBit/s since the former ARECA did only SATA II - not really a viable option for ASM1166, dmesg first says speed is limited, afterwards raised to 6 GBit/s, though. This works for the FCH controller, though.
  4. In the meantime, I found that the cage could have been the culprit as all affected drives were on that cage. Changed the MOLEX splitter cables to direct power supply connections.
  5. After that, everything SEEMED to work, but now I get errors even on the FCH-connected drives.
  6. I now have set link_power_management_policy to max_performance, scrubbing right now and hoping for the best.

Also, I see those errors turning up only after a few hours of time (or lots of data) has passed.

Using the JMB585 could still be no option, even if now drives on the motherboard controller show these errors, because I can probably limit SATA speed with that controller, which was impossible with the ASM1166. I will try that as a last-but-one resort if limiting link power does not resolve this.

I hate the thought of having to use a HBA adapter consuming more power.

P.S.: The JMB585 can be limited to 3 Gbps, Otherwise, no change, I still get errors on random disks. Have ordered an LSI 9211.8i now. However, this points to a real problem in the interaction between libata and ZFS.

P.P.S: I disabled NCQ and the problem is gone. I did not bother to try the LSI controller. Will follow up with some insights.

meyergru commented 1 year ago

OpenZFS for Linux problem with libata - root cause identified?

Just to reiterate on what I wrote about this here, I have a Linux box with 8 WDC 18 TByte SATA drives, 4 of which are connected through the mainboard controllers (AMD FCH variants) and 4 through an ASMEDIA ASM1166. They build a raidz2 running under Proxmox with a 6.2 kernel. During my nightly backups, the drives would regularly fail and errors showed up in the logs, more often than not "unaligned write errors".

First thing to note is that one poster in the thread mentioned that the "Unaligned write" is a bug in libata, in that "other" errors are mapped to this one in the scsi translation code (https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/). Thus, the error itself is meaningless.

In the thread, several possible remedies were offered, such as:

  1. Faulty SATA cables (I replaced them all, no change, but I admit this could be the problem in some cases)
  2. Faulty disks (Mine were known to be good, and also, errors were randomly distributed among them)
  3. Power saving in the SATA link or the PCI bus (disabling this did not help)
  4. Problematic controllers (Both the FCH and the ASM1166 chips as well as a JMB585 showed the same behaviour)
  5. Limiting SATA speed to SATA 3.0 Gbps or even to 1.5 Gbps (3.0 Gbps did not help, and was not even possible with the ASM1166 as the speed was always reset to 6.0 Gbps, but I could check with FCH and JMB585 controllers)
  6. Disabling NCQ (guess what, this helped!)
  7. Replacing the SATA controllers with an LSI 9211-8i (I guess this would have helped, as others have reported, because it probably does not use NCQ)

I am 99% sure that it boils down to a bad interaction between OpenZFS and libata with NCQ enabled and I have a theory why this is so: When you look at how NCQ works, it is a queue of up to 32 (or to be exact 31 for implementation reasons) tasks that can be given to the disk drive. Those tasks can be handled in any order by the drive hardware, e.g. in order to minimize seek times. This, when you give the drive 3 tasks, like "read sectors 1, 42 and 2, the drive might decide to reorder them and read sector 42 last, thus saving one seek in the process.

Now imagine a time of high I/O pressure, like when I do my nightly backups. OpenZFS has some queues of its own which are then given to the drives and for each task started, OpenZFS expects a result (but in no particular order). However, when a task returns, it opens up a slot in the NCQ queue, which is immediately filled with another task because of the high I/O pressure. That means that the sector 42 could potentially never be read at all, provided that other tasks are prioritized higher by the drive hardware.

I believe, this is exactly what is happening and if one task result is not received within the expected time frame, a timeout with an unspecific error occurs.

This is the result of putting one (or more) quite large queues within OpenZFS before a smaller hardware queue (NCQ).

It explains why both solutions 6 and probably 7 from my list above cure the problem: Without NCQ, every task must first be finished before the next one can be started. It also explains why this problem is not as evident with other filesystems - were this a general problem with libata, it would have been fixed long ago.

I would even guess reducing SATA speed to 1.5 Gbps would help (one guy reported this) - I bet this is simply because the resulting speed of ~150 MByte/s is somewhat lower than modern hard disks, such that the disk can always finish tasks before the next one is started, whereas 3 Gpbs is still faster than modern spinning rust.

If I am right, two things should be considered:

a. The problem should be analysed and fixed in a better way, like throttling the libata NCQ queue if pressure gets too high, just before timeouts are thrown. This would give the drive time to finish existing tasks. b. There should be a warning or some kind of automatism to disable NCQ for OpenZFS for the time being.

I also think that the parformance impact of disabling NCQ with OpenZFS is probably neglible, because OpenZFS has prioritized queues for different operations anyway.

dd1dd1 commented 1 year ago

(I am the OP, I have some experience with linux kernel drivers, and embedded firmware development)

I like how you wrote it all up, but I doubt you can bring a closure to this problem. IMO, the list of "remedies" is basically snake oil, if any of these remedies was "a solution", this bug would be closed long time ago.

I think "NCQ timeout" does not explain this problem: (a) if NCQ specs permitted a queued command to be postponed indefinitely (and cause a timeout), it would be a serious bug in the NCQ specs, unlikely, but one has to read them carefully to be sure. if specs permitted it, it would likely have caused trouble in places other than ZFS, people would have complained, specs would have been fixed. of course it is always possible that specific HDD firmware has bugs and can postpone some queued commands indefinitely (and cause a timeout), even if specs do not permit it. (b) if queued command timeout was "it", disabling NCQ would have been "the solution", and the best we can tell, it is not.

I think we now have to wait for the libata bug fix to make it into production kernels. Then we will see what the actual error is. "unaligned write command" never made sense to me, and now we know it is most likely bogus.

K.O.

meyergru commented 1 year ago

I did not imply that NCQ allows a command to be left indefinitely in itself. It can only be postponed by the hardware in that it may reorder the commands in any way it likes. This is just how NCQ works.

Thus, an indefinite postponing can only be occur if someone "pressures" the queue consistenly - actually, the drive is free to reorder new incoming commands and intersperse them with previous ones - matter-of-fact, there is no difference between issuing 32 commands in short succession and issuing a few more only after some have finished. Call that behaviour a design flaw, but I think it exists and the problem in question surfaces only when some other conditions are met.

And I strongly believe that OpenZFS can cause exactly that situation, especially with write patterns of raidz under high I/O pressure. I doubt that this bug would occur with other filesystems where no such complex patterns from several internal queues ever happen.

As to why the "fixes" worked sometimes (or seemed to have worked): As I said, #6 and #7 both disable NCQ. Reducing the speed to 1.5 Gbps will most likely reduce the I/O pressure enough to make the problem go away and other solutions may help people who really have hardware problems.

Also, I have read nobody so far who has tried to disable NCQ and not done something else alongside (e.g. reducing speed as well). I refrained from disabling NCQ first only because I thought it would hurt performance - which it did not. Thus, my experiments ruled out one single potential cause after another, leaving only the disabling of NCQ as the effective cure. I admit that I probably should wait a few more nights before jumping to conclusions, however these problem were consistent with every setup I tried so far. (P.S.: It has been three days in a row now that no problems occured)

Nothing written here nor anything I have tried so far refutes my theory.

I agree there is a slight chance of my WDC drives having a problem with NCQ in the first place - I have seen comments on some Samsung SSDs having that problem with certain firmware revisions. But that would not have gone unnoticed, I bet.

m6100 commented 1 year ago

Just FYI the Unaligned Write error is a bug in Linux's libata:

https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/

Unfortunately, this patch was never applied and the issue got no further attention after a short discussion. There also seems to be no other cleanup having been done on this topic, at least I couldn't find anything related in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/drivers/ata/libata-scsi.c

Basically libata is not implementing the SAT (SCSI-to-ATA) translation correctly and ends up converting ATA errors it shouldn't, producing nonsense. This is still a real error (probably a link reset or similar), but the SCSI conversion is broken.

I think you are correct on this, I'm seeing the error also on a system where I put a new 6.0 GBps Harddisk with 512 byte sectors in a older cartridge with old cabling (going to replace this). For this zpool status lists 164 write errors after writing about 100GB and degraded state. The harddisk increased UDMA_CRC_Error_Count SMART raw value from 0 to 3 but otherwise has no problems. The dmesg info also indicates a prior interface/bus error that is then decoded as unaligned write on the tagged command:

[18701828.321386] ata4.00: exception Emask 0x10 SAct 0x3c00002 SErr 0x400100 action 0x6 frozen [18701828.322053] ata4.00: irq_stat 0x08000000, interface fatal error [18701828.322652] ata4: SError: { UnrecovData Handshk } [18701828.323256] ata4.00: failed command: WRITE FPDMA QUEUED [18701828.323885] ata4.00: cmd 61/58:08:b8:da:63/00:00:48:00:00/40 tag 1 ncq dma 45056 out res 40/00:08:b8:da:63/00:00:48:00:00/40 Emask 0x10 (ATA bus error) [...] [18701830.479670] sd 4:0:0:0: [sdo] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s [18701830.479679] sd 4:0:0:0: [sdo] tag#1 Sense Key : Illegal Request [current] [18701830.479687] sd 4:0:0:0: [sdo] tag#1 Add. Sense: Unaligned write command [18701830.479696] sd 4:0:0:0: [sdo] tag#1 CDB: Write(16) 8a 00 00 00 00 00 48 63 da b8 00 00 00 58 00 00

lorenz commented 1 year ago

The reason it never got applied is mostly because as it turns out this is a deeper architectural issue with libata, there is no valid SCSI error code here. Sadly I'm not familiar enough with Linux SCSI midlayer to implement the necessary changes.

CRC errors are not the only type of link error. You are probably losing the SATA link which causes reset/retraining which is one of the known things libata doesn't handle correctly.

ngrigoriev commented 5 months ago

I have just got this problem on a brand new WD Red WD40EFPX. I bought it yesterday to replace the failed mirror drive. Getting these errors while ZFS is resilvering it.

No question about faulty controller, cable or anything else hardware-related. The system worked for a very long time, the failed component I have replaced is the disk. The new disk is unlikely to be bad. One way or another, it is related to the new disk interaction with the old system.

lorenz commented 5 months ago

This is not really a ZFS issue, it's a hardware/firmware issue being handled badly by the Linux kernel's SCSI subsystem. This issue should probably be closed here.

@ngrigoriev These errors are in pretty much all cases hardware/firmware-related. Post the kernel log if you want me to take a look at the issue.

ngrigoriev commented 5 months ago

@lorenz

I was able to finish the resilvering process, but at the very end it started failing again. And this was with "libata.force=3.0 libata.force=noncq"

Right after reboot:

[   74.500345] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   74.500372] ata3.00: failed command: WRITE DMA EXT
[   74.500391] ata3.00: cmd 35/00:08:e8:28:00/00:00:16:00:00/e0 tag 23 dma 4096 out
                        res 40/00:20:58:6f:05/00:00:00:00:00/e0 Emask 0x4 (timeout)
[   74.500409] ata3.00: status: { DRDY }
[   74.500423] ata3: hard resetting link
[   74.976320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[   74.977298] ata3.00: configured for UDMA/133
[   74.977339] sd 2:0:0:0: [sdb] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
[   74.977347] sd 2:0:0:0: [sdb] tag#23 Sense Key : Illegal Request [current]
[   74.977354] sd 2:0:0:0: [sdb] tag#23 Add. Sense: Unaligned write command
[   74.977363] sd 2:0:0:0: [sdb] tag#23 CDB: Write(16) 8a 00 00 00 00 00 16 00 28 e8 00 00 00 08 00 00
[   74.977373] blk_update_request: I/O error, dev sdb, sector 369109224 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[   74.977403] zio pool=******** vdev=/dev/disk/by-id/ata-WDC_WD40EFPX-68C6CN0_WD-WX12D14P6EH0-part1 error=5 type=2 offset=188982874112 size=4096 flags=180880
[   74.977433] ata3: EH complete

Drive:

Device Model:     WDC WD40EFPX-68C6CN0
Serial Number:    WD-WX12D14P6EH0
LU WWN Device Id: 5 0014ee 216332baf
Firmware Version: 81.00A81
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Sun Jun  9 09:50:07 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0022   111   107   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       8
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

I understand that it is not zfs fault directly, but, interestingly enough, it is triggered by ZFS specifically. This machine has 5 HDDs. None of them demonstrated this issue for years. It only started to happen with this new WD Red drive that replaced the failed one.

ngrigoriev commented 5 months ago

Tried everything I have read about. All combination of the options, libata.force (noncq,3.0, even 1.5). Nothing really worked. At most after a couple of hours, even after a successful resilvering of the entire drive, a bunch of errors would just appear. Also I have noticed that if I set the speed to 1,5Gbps, then other drives on this controller start getting the similar problems.

SCSI power settings are forced to max_power for all hosts.

I have a combination of different drives in this home NAS, some are 6Gbps, some are 3.0.

What I am trying now, I have connected this new drive to the second SATA port on the motherboard instead of the PCIE SATA controller. And I have removed all libata settings. So far so good, keeping fingers crossed. If that does not help, then I am out of options. I have already changed the cable to be sure. Another controller? Well, I have already tried two different ones effectively: ASMedia and the onboard Intel one.

lorenz commented 5 months ago

Basically what's happening is that your disk does not respond within 7 or 10s (the Linux ATA command timeout) to a write command the kernel sent. ATA does not have a good way to abort commands (SCSI and NVMe do), so the kernel "aborts" the command by resetting the link. Problem is the link reset is improperly implemented in Linux, resulting in spurious write errors and bogus error codes. Basically Linux does not automatically retry outstanding I/O requests, instead failing them with an improper error code as this is not standardized behavior and as such doesn't have a proper error code.

Unless you've hot-plugged the disk I suspect you have either a cabling issue or one side of the link (either the SATA controller or the disk controller) is bad as we see CRC errors on the link. This would explain the weird timeouts as the command might have been dropped due to a bad CRC.

ngrigoriev commented 5 months ago

Yes, it seems so, and it only happens under heavy write activity, apparently.

Is there a way to control this timeout? (I understand it is not the place to ask this kind of question :( )