Open dd1dd1 opened 4 years ago
Message ID: @.***> I have two disks out of 13 that are repeatably plagued with this error -- one WD30EFRX, one a brand new WD30EFZX. The other 11 seem to operate reliably in both resilver and scrub. Go figure...
Currently using dual 88SE9230 PCIe 4 port SATA cards (started with one and the onboard AMD 400 series chipset). I tried changing to a 9207-8i controller -- based on reports of success with lsilogic -- mainly to test whether a different driver would not have the issue -- and that was a disaster, could not get Ubuntu 20.04 to even reliably recognize the presence of all the disks.
Tried nocq in the ata driver, no impact on the bug.
I'm sorely tempted to dive into the driver and at least try to log the sata command history to get some info about what exactly is going south, but I am in the middle of a house remodel and don't have time for all this.
I am also seeing these same errors on X10SDV, 4x HDD connected to first 4 sata ports- no problem. 2x Samsung 870 QVO- errors occur after heavy io. They were occurring immediately after installation, so it seems unlikely that i received two defective disks. Things that did not help:
@georgewhewell
Volumes/datasets ? Compression on (type, if on) ?
@georgewhewell
Volumes/datasets ? Compression on (type, if on) ?
just dataset, no compression-
$ sudo zfs get -o "all"
fpool/root/Home type filesystem - -
fpool/root/Home creation Fri Apr 2 13:30 2021 - -
fpool/root/Home used 2.83T - -
fpool/root/Home available 10.5T - -
fpool/root/Home referenced 2.82T - -
fpool/root/Home compressratio 1.00x - -
fpool/root/Home mounted yes - -
fpool/root/Home quota none - default
fpool/root/Home reservation none - default
fpool/root/Home recordsize 128K - default
fpool/root/Home mountpoint legacy legacy received
fpool/root/Home sharenfs off - default
fpool/root/Home checksum on - default
fpool/root/Home compression off - default
fpool/root/Home atime on - default
fpool/root/Home devices on - default
fpool/root/Home exec on - default
fpool/root/Home setuid on - default
fpool/root/Home readonly off - default
fpool/root/Home zoned off - default
fpool/root/Home snapdir hidden - default
fpool/root/Home aclmode discard - default
fpool/root/Home aclinherit restricted - default
fpool/root/Home createtxg 1485157 - -
fpool/root/Home canmount on - default
fpool/root/Home xattr on - default
fpool/root/Home copies 1 - default
fpool/root/Home version 5 - -
fpool/root/Home utf8only off - -
fpool/root/Home normalization none - -
fpool/root/Home casesensitivity sensitive - -
fpool/root/Home vscan off - default
fpool/root/Home nbmand off - default
fpool/root/Home sharesmb off - default
fpool/root/Home refquota none - default
fpool/root/Home refreservation none - default
fpool/root/Home guid 1183886347832676596 - -
fpool/root/Home primarycache all - default
fpool/root/Home secondarycache all - default
fpool/root/Home usedbysnapshots 16.1G - -
fpool/root/Home usedbydataset 2.82T - -
fpool/root/Home usedbychildren 0B - -
fpool/root/Home usedbyrefreservation 0B - -
fpool/root/Home logbias latency - default
fpool/root/Home objsetid 100886 - -
fpool/root/Home dedup off - default
fpool/root/Home mlslabel none - default
fpool/root/Home sync disabled disabled received
fpool/root/Home dnodesize legacy - default
fpool/root/Home refcompressratio 1.00x - -
fpool/root/Home written 0 - -
fpool/root/Home logicalused 2.81T - -
fpool/root/Home logicalreferenced 2.80T - -
fpool/root/Home volmode default - default
fpool/root/Home filesystem_limit none - default
fpool/root/Home snapshot_limit none - default
fpool/root/Home filesystem_count none - default
fpool/root/Home snapshot_count none - default
fpool/root/Home snapdev hidden - default
fpool/root/Home acltype off - default
fpool/root/Home context none - default
fpool/root/Home fscontext none - default
fpool/root/Home defcontext none - default
fpool/root/Home rootcontext none - default
fpool/root/Home relatime on - temporary
fpool/root/Home redundant_metadata all - default
fpool/root/Home overlay on - default
fpool/root/Home encryption off - default
fpool/root/Home keylocation none - default
fpool/root/Home keyformat none - default
fpool/root/Home pbkdf2iters 0 - default
fpool/root/Home special_small_blocks 0 - default
fpool/root/Home nixos:shutdown-time Fri 3 Jun 12:34:55 BST 2022 - inherited from fpool
I see. Anyway, I think - this is a Samsung problem (some specific internals). I did have problems only with this vendor. I use ZFS with Adaptect RAID, Hynix, Seagate, WD, Intel, LVM, LUKS etc. And problems only with Samsung.
It's not a Samsung problem. Several of us (including me) saw it with other devices. Though Samsung has certainly had SSD firmware issues. I also have an X10SDV @georgewhewell. Updating to the latest proxmox 7 so far has fixed it for me. At least, I haven't seen it since updating a few months ago. I'm running 7.2-4 now (this kernel specifically: Linux 5.13.19-6-pve #1 SMP PVE 5.13.19-15
). Not sure if you're using proxmox or not. If not, then perhaps just updating to a later 5.x kernel will help you on whatever distro you're on.
Ran into this issue a couple of days of go on Rocky Linux 8.6 with ZFS 2.1.5.
Migrated data off LVMRAID, created a ZFS raidz1 on the same disks and rsynced the data back. During rsync one of the disks was marked FAULTED by ZFS.
Smartctl didn't indicate any hardware errors. Ran "zpool clear", during resilver all disks in the vdev had CKSUM errors.
Checked dmesg, all disks in the vdev reported read or write issues. Seemed unlikely that all drives were going bad at the same time, also unlikely that all SATA cables were faulty.
Dmesg also contained reports of "Unaligned write command" and Google led me to this issue, https://github.com/openzfs/zfs/issues/10094#issuecomment-623603031 had the clue I needed:
echo max_performance | sudo tee /sys/class/scsi_host/host*/link_power_management_policy
I use the tuned profile "balanced" in Rocky Linux, turns out it sets ALPM to "medium_power", changed that to "max_performance".
Restarted rsync, after completion (no more errors) I have run multiple scrubs with zero errors. Disabling link power management seems to have solved the problem for me.
EDIT
Hardware used:
AMD Ryzen 5700X CPU Gigabyte X570S UD motherboard Kingston KSM32ED8 ECC memory Western Digital WD Red Plus WD40EFZX disks
I fed up with it and replaced my Samsung SSD 850 PRO 1TB with WDS200T2B0A-00SM50, same cables and all. Did a full scrub, some random write benchmarks to be sure. And guess what, no more fails. I guess it's something in the Samsung firmware that does not like what ZFS is specifically doing with it. Because the drive is fine and takes whatever fio shenanigans I do on the raw block device. It is now happily serving as an external drive for PS5
I use the tuned profile "balanced" in Rocky Linux, turns out it sets ALPM to "medium_power", changed that to "max_performance".
Apparently this is a must for ZFS. At least with SATA. This should be documented
I use the tuned profile "balanced" in Rocky Linux, turns out it sets ALPM to "medium_power", changed that to "max_performance".
Apparently this is a must for ZFS. At least with SATA. This should be documented
Agreed, though this still doesn't explain the problem for those of us who did that early on (or don't even have Samsung devices) and saw the same problems on multiple SSDs simultaneously (brand new ones). Happily, I still haven't seen problems since updating to proxmox 7.1...hoping I just haven't been lucky for now. :-P
I have meet the same issue. Runing 3 servers , each one have 2TB Crucial MX500 SDD x 2 running in RAID0 via mdadm.
I am not using ZFS, but the same Sense: Unaligned write command
error comes out randomly.
the OS is Ubuntu Server 20.04
with Kernel 5.15.0-46-generic
kernel: [269471.003856] ata2.00: exception Emask 0x10 SAct 0x20000000 SErr 0x2c0100 action 0x6 frozen
kernel: [269471.003887] ata2.00: irq_stat 0x08000000, interface fatal error
kernel: [269471.003896] ata2: SError: { UnrecovData CommWake 10B8B BadCRC }
kernel: [269471.003916] ata2.00: failed command: READ FPDMA QUEUED
kernel: [269471.003926] ata2.00: cmd 60/00:e8:00:0d:04/02:00:00:00:00/40 tag 29 ncq dma 262144 in
kernel: [269471.003926] res 40/00:ec:00:0d:04/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
kernel: [269471.003951] ata2.00: status: { DRDY }
kernel: [269471.003965] ata2: hard resetting link
kernel: [269471.479725] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
kernel: [269471.480710] ata2.00: supports DRM functions and may not be fully accessible
kernel: [269471.482862] ata2.00: supports DRM functions and may not be fully accessible
kernel: [269471.483668] ata2.00: configured for UDMA/133
kernel: [269471.493807] sd 1:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
kernel: [269471.493810] sd 1:0:0:0: [sda] tag#29 Sense Key : Illegal Request [current]
kernel: [269471.493812] sd 1:0:0:0: [sda] tag#29 Add. Sense: Unaligned write command
kernel: [269471.493814] sd 1:0:0:0: [sda] tag#29 CDB: Read(10) 28 00 00 04 0d 00 00 02 00 00
kernel: [269471.493814] blk_update_request: I/O error, dev sda, sector 265472 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
kernel: [269471.493833] ata2: EH complete
But a strangely thing in my servers is, this error ONLY occurs in /dev/sda
of each server.
In my environment, everytime this error happens, it will increase the number of UDMA_CRC_Error_Count in SMART. |
Server | Device | UDMA CRC ERROR |
---|---|---|---|
server-1 | /dev/sda | 109 | |
/dev/sdb | 0 | ||
server-2 | /dev/sda | 4 | |
/dev/sdb | 0 | ||
server-3 | /dev/sda | 1 | |
/dev/sdb | 0 |
So it only occurs on /dev/sda
(the difference between servers is caused by the running time server1 >> server2 > server3)
I have completely no idea to solve this problem. I have tried swap sata cables, not work.
I will have a try to set link_power_management_policy
to max_performance
and see the result.
And by the way, following changes it not permanent, it will disappear after reboot
echo max_performance | sudo tee /sys/class/scsi_host/host*/link_power_management_policy
for permanent change
add a file to /etc/udev/rules.d
and name it like 60-scsi.rules
and edit the content like following
KERNEL=="host[0-2]", SUBSYSTEM=="scsi_host", ATTR{link_power_management_policy}="max_performance"
Do you know what SATA controller(s) your servers use? Given that it's not ZFS at all (and assuming it's not a hardware problem on all three) that seems to point to something in the kernel such as the controller.
Here you are
lspci result
$ sudo lspci -v -s 05:00.0
05:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81) (prog-if 01 [AHCI 1.0])
Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
Flags: bus master, fast devsel, latency 0, IRQ 38
Memory at fcd01000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [64] Express Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/2 Maskable- 64bit+
Capabilities: [d0] SATA HBA v1.0
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [270] Secondary PCI Express
Capabilities: [400] Data Link Feature <?>
Capabilities: [410] Physical Layer 16.0 GT/s <?>
Capabilities: [440] Lane Margining at the Receiver <?>
Kernel driver in use: ahci
Kernel modules: ahci
$ sudo lspci -v -s 05:00.1
05:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81) (prog-if 01 [AHCI 1.0])
Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode]
Flags: bus master, fast devsel, latency 0, IRQ 45
Memory at fcd00000 (32-bit, non-prefetchable) [size=2K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Capabilities: [64] Express Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=2/2 Maskable- 64bit+
Capabilities: [d0] SATA HBA v1.0
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Kernel driver in use: ahci
Kernel modules: ahci
I think this is a SATA Controller embedded in Ryzen 5000 mobile series
and there is some other infomartion | Server | CPU | Memory | SSD | SSD Firmware |
---|---|---|---|---|---|
server-1 | AMD Ryzen 5900HX | Essencore KD4BGSA8C-32N220D DDR4-3200 32GBx2 | Crucial CT2000MX500SSD1 x2 | M3CR033 | |
server-2 | AMD Ryzen 5900HX | Essencore KD4BGSA8C-32N220D DDR4-3200 32GBx2 | Crucial CT2000MX500SSD1 x2 | M3CR043 | |
server-3 | AMD Ryzen 5900HX | Crucial CT32G4SFD832A DDR4-3200 32GBx2 | Crucial CT2000MX500SSD1 x2 | M3CR045 |
Here is some of my analyses and guess
from this thread, I found that these issues have some common points
So I make a guess there maybe some issue in software RAID dealing with the ALPM. (like don't know how to deal when a disk in RAID was put into sleep by ALPM)
Anyway, I will have a look at change the ALPM to max_performance
will solve this issue or not.
And the ALPM document from RedHat have some interesting infomation.
In the end of this document, it says
Setting ALPM to min_power or medium_power will automatically disable the "Hot Plug" feature.
So, if you are running RAID1, RAID5, ... and need to hot swap failed disk, you need to set ALPM to max_performance
anyway.
I had the problem with max performance set. For me, the problem only happened on SSDs (which I think was a common theme on this thread). And went away completely when not using the onboard SATA. It now also seemed to go away completely after I updated to the latest proxmox kernel. But, you're on a newer kernel than what I'm running (5.13.19-15 PVE kernel) and hit the issue.
Ummm... seems many factors cause this error, or you just forgot make the max performance set permanent and did some reboot :D
Will have a look for some weeks to see whether max performance setting solve the problem in my case and report here.
@stuckj I just want to mention that I only had the issue with HDDs and my SSDs worked fine, so I don't think that SSDs are the common factor. This was on half a dozen different servers.
I do agree that the problem completely went away when I abandoned the onboard SATA for an LSI HBA.
I tried a lot of things, but I'm not sure I ever tried the max performance thing. I now have the latest Proxmox, but I have already made the hardware changes with the HBA, and I'm not interested in messing with the hardware to trigger this problem again, so I don't know whether latest Proxmox would solve my problem or not. :)
And after a month of monitoring, I am now sure set link_power_management_policy
to max_performance
solved the Unaligned write command issue completely, at least in my environment.
I also confirmed this issue come back when turn link_power_management_policy
back to the default value.
Just for your information.
Thanks for that information, I'm setting that right now!
If I remember to do so, I can also report back if that fixed it for me. But I got the errors infrequently, so to definitely consider it fixed I'll have to wait a few months.
By the way, are there better ways to set the link_power_management_policy
than via a systemd oneshot service? That is what I have done.
Also encountered this, I set link_power_management_policy
to max_performance
as above, as well as setting the Drive Settings in the gnome-disks GUI to never power down and disabling NVQ. Doing so has indeed stabilised this issue, showing this is indeed a software configuration issue.
I had already ruled out the SATA controller itself as being the issue, as the same hard drive encountered this same issue, no matter whether it was plugged into the motherboard's SATA ports or the HBA card's ports.
Seems that https://bugzilla.kernel.org/show_bug.cgi?id=203475 is related: Samsung EVO 860/870 Firmware is reported to have issues with NCQ + trim.
The latest comment in that bug mentions that updating the drive firmware fixes the issue (for Samsung EVO 860/870 drives). Is it safe to do the drive firmware update without taking the whole machine down by just offlining one drive at a time with zpool offline
?
Very informative read, a bit long, but worth the time. I was struggling to isolate some errors I was seeing, given smartmontools smartctl was reporting ICRC ABRT errors for multiple drives, I took the test effort back to the lowest common level, and started testing, the interesting thing was, it seemed like a power level issue or a back plane board issue in my storage frame, but as it turned it it was one of the internal cables from the back plane board (1 of 2) to the eSata port transition out of the storage frame case, which is an 8-bay eSATA tower.
For those that might be interested, below is the test methodology I used...
WARNING... I DID NOT NEED TO MAINTAIN THE DATA IN THE GIVEN STORAGE FRAME.... The following steps will overwrite existing data, be careful.
I added a multiplex PCIe card/adapter, did not solve the issue. So the mainboard eSATA ports and the PCIe card/adapter both seem fine, but still getting random read/write errors. So I disabled NCQ, still errors.
Then replaced the external cables from server to storage frame, still errors.
Crossed the internal SATA cables of the back plane, since it was a split design, had two boards, each supporting 4 devices. The issue at first seem to move from back plane 1 to back plane 2, but as I did a bit more testing, I started getting reports of bad sectors on drives I believed fine, passing various smart tests, via smartmontools.
I pulled some additional drives, from my spare parts, and swapped 4 drives on back plane 1, now errors reported on the drives just swapped in. So swapped in more drives, still errors on back plane 2 as well. Did more smart tests, drives seemed fine, when used on different system no errors.
So I replaced the internal SATA cables as well, things seemed to stabilize. So then I used 'dd' to do an exhaustive random write to all sectors, first the just drives in back plane 1, no errors. About 2 hours later, still ok. That is a good sign.
Then I did the same test with the same set of now believed good drives, with the new internal SATA cables once connected to back plane 1, connected to back plane 2, even better, no errors. So now I knew the external cables, and internal cables seemed good. And, the back plane boards seem good.
After 2 hours of constant random writes to all 8 drives, still no errors. Will let the test continue for a couple more hours, but when errors occurred it only took minutes to about 1 hour to get 10s to 100s of errors across all 8 drives. This is also pushing the power supply on the storage frame, since have all 8 drives racing to slam data to sectors exhaustively. Oh, minor, but I did confirm that write cache is off during the dd tests. You might want to make sure you set write cache on or off depending on your use/test case. in my case, I need data saved, not performance, so write cache stays off.
I still need to enable NCQ, to confirm everything is completely legit. Just to do that last step of validation. Of course, setting up an mdadm RAID set, say 1 RAID 5 set per back plane, or setting up a ZFS pool per back plane or across all 8 drives would also work as a test scenario, but using 'dd' was easy. And I wanted to make sure the power supply was stable.
Even using FIO would be applicable, not that I think about it.
Why the errors, seems the internal SATA cables from back planes from external eSATA port sockets, just aged badly, the internal temperature case does it pretty warm, even with fans and venting, and the air flow, from power supply and drives, you guessed it, goes right through the internal SATA cables to escape. The case has rear fan exhaust, but not top exhaust, if it was possible I would add a vent or even better a top exhaust fan.
Hope those that find this, find it helpful.
I only started seeing this error after upgrading from Ubuntu 18.04 to 22.04 and rebuilding the pool. I've replaced SATA cables and some disks, and the issue persists. I'm suspecting it has something to do with how non-Solaris zfs currently handles NCQ. I stumbled across this thread which quotes this article from 2009 where :
SATA disks do Native Command Queuing while SAS disks do Tagged Command Queuing, this is an important distinction. Seems like OpenSolaris/Solaris is optimized for the latter with a 32 wide command queue set by default. This completely saturates the SATA disks with IO commands in turn making the system unusable for short periods of time.
Dynamically set the ZFS command queue to 1 to optimize for NCQ. echo zfs_vdev_max_pending/W0t1 | mdb -kw And add to /etc/system set zfs:zfs_vdev_max_pending=1 Enjoy your OpenSolaris server on cheap SATA disks!
I am not sure how to verify if what he says would be true for my install, how to change that parameter in Ubuntu/Linux, or what the optimal queue depth would be, if anyone has ideas. The OpenZFS docs on command queuing mention that zfs checks for command queue support on Linux with:
hdparm -I /path/to/device \| grep Queue
The output for my disks looks like:
Queue depth: 32
* Native Command Queueing (NCQ)
I am a bit more convinced that the issue is somehow related to the above, since @bjquinn was able to solve the problem by switching to SAS-capable HBAs which presumably use a SAS-to-SATA adapter in his case, and makes me wonder how command queuing is handled with his controllers.
The author of the article also mentioned that his performance issues were mostly attributed to TLER being disabled by default, and the OpenZFS docs on error recovery control mention this as well and recommend writing a script that enables it on every boot with a low value. With the Hitachi/HGST drives that I have, none of them accept values lower than 6.5 seconds, which might be some capability requiring enterprise software from the manufacturer. I would think the capability exists with hdparm (can't find it, if so), but setting ERC with smartctl is done in units of deciseconds, if anyone else needs or wants to set this:
smartctl -l scterc,65,65 /dev/disk
I'm also starting to suspect the issue may have something to do with enterprise hardware being sold to consumers, where some features come disabled by default given a certain hardware setup, thinking of the choice of controller in @bjquinn's case. I also have some SED drives that require TCG Enterprise to access most of the features (and probably better TLER times), and the SED features are only partially available using hdparm and smartctl.
The Wikipedia page on NCQ is also informative:
NCQ can negatively interfere with the operating system's I/O scheduler, decreasing performance;[8] this has been observed in practice on Linux with RAID-5.[9] There is no mechanism in NCQ for the host to specify any sort of deadlines for an I/O, like how many times a request can be ignored in favor of others. In theory, a queued request can be delayed by the drive an arbitrary amount of time while it is serving other (possibly new) requests under I/O pressure.[8] Since the algorithms used inside drives' firmware for NCQ dispatch ordering are generally not publicly known, this introduces another level of uncertainty for hardware/firmware performance. Tests at Google around 2008 have shown that NCQ can delay an I/O for up to 1–2 seconds. A proposed workaround is for the operating system to artificially starve the NCQ queue sooner in order to satisfy low-latency applications in a timely manner.[10]
On some drives' firmware, such as the WD Raptor circa 2007, read-ahead is disabled when NCQ is enabled, resulting in slower sequential performance.[11]
For the moment I've set my onboard controllers to IDE mode from AHCI, and configured TLER as mentioned above to see if this helps at all.
For the moment I've set my onboard controllers to IDE mode from AHCI, and configured TLER as mentioned above to see if this helps at all.
None of that was helpful. I ended up getting an HBA card as others mentioned here previously, and the errors have been gone now for more than a month. I don't mind the investment too much, but I'd love to know technically why this has become a necessity for ZFS. I never had these issues several LTS-releases ago.
@bjquinn @stuckj Can either of you tell me what you used to flash your HBAs? I'm running Ubuntu 22.04 and I'm having trouble figuring out what utility to use. I have an LSI 9305-16i which has been running great until some recent kernel update bricked it, and I'm hoping a flash will fix it (HBA card currently fails to load its BIOS).
Yikes. I flashed mine years ago. I don't recall offhand, but I believe I followed a thread on the TrueNAS forums (I was using ESXi + FreeNAS at the time). It may have been this one? https://www.truenas.com/community/threads/detailed-newcomers-guide-to-crossflashing-lsi-9211-9300-9305-9311-9400-94xx-hba-and-variants.55096/page-3
Best of luck.
@Shellcat-Zero
This is from my notes, though like @stuckj I haven't tried this in quite some time.
This happened to me on a Coreboot Thinkpad T430 with 2x Crucial MX500 (one is in ultrabay) on Debian 11.
Setting link_power_management_policy
to max_performance
as suggested prevents the error from happening (so far, it's been fine for ~3 days). Strange issue, nonetheless.
This next paragraph is more of a speculation, as my setup is a bit exotic - ZFS mirror on a laptop with one of these drives in ultrabay: I ran Crucial MX300 disks in this very machine in the same configuration before and didn't have these issues. Also, I noticed that the errors always happened on the MX500 with firmware M3CR033
while the one with M3CR043
didn't malfunction in this way. As far as I know, there's no clear/easy upgrade path to this firmware provided by Crucial.
I really don't like experimenting around with SSDs and if MX300 were available at the time I was buying these, I'd have gone with them instead. Every one of those recently bought SSDs I owned had weird issues that required firmware updates. MX500 has this bug, Intel 660p had PCIe passthrough broken until a firmware update. Never had any issue with MX300 or MX200.
When I ran into the "unaligned write command", I also had a MX500 with firmware M3CR033 running (in a thinkpad W530). I've set link_power_management_policy
to max_performance
on 2022-10-02, and the problem did not return.
On Ubuntu 20.04.2 I was unable to achieve stable ZFS operation due to "unaligned" and had to move to a 9207-8i LSI HBA. This initially did not work well at all, I think the first such card that I bought had issues, it was a apparently a cheap Asian knockoff. I found a reputable seller who sold me a surplus HP card that is working very well.
Then, ZFS being ZFS, it found some bad sectors and a bad hotswap tray -- which is exactly why my interest in ZFS in the first place -- but everything is now fixed and my "unaligned" issues are over.
Just another "me too" here.
Three spinning rust drives connected to a consumer-grade motherboard's SATA links:
Model Family: Seagate BarraCuda 3.5 (CMR)
Device Model: ST1000DM010-2EP102
Serial Number: ZN1VFRPD
LU WWN Device Id: 5 000c50 0e41db863
Firmware Version: CC46
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5417
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Jul 19 09:16:31 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Model Family: Toshiba P300 (CMR)
Device Model: TOSHIBA HDWD110
Serial Number: 92MY48ANS
LU WWN Device Id: 5 000039 fc5e96dcb
Firmware Version: MS2OA9A0
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5417
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Jul 19 09:17:06 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Model Family: Western Digital Blue
Device Model: WDC WD10EZEX-00BBHA0
Serial Number: WD-WCC6Y7VL8C4H
LU WWN Device Id: 5 0014ee 26a20209f
Firmware Version: 01.01A01
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5417
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Jul 19 09:17:25 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Motherboard:
Base Board Information
Manufacturer: ASUSTeK COMPUTER INC.
Product Name: TUF B450-PLUS GAMING
CPU:
model name : AMD Ryzen 7 3700X 8-Core Processor
Kernel:
Linux services02 6.3.4-201.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sat May 27 15:08:36 UTC 2023 x86_64 GNU/Linux
All drives consistently pass SMART tests, but a zfs scrub
would typically fail on that one Western Digital drive consistently. Errors would be logged in dmesg:
[382950.946984] ata15.00: exception Emask 0x0 SAct 0x1c80000 SErr 0xd0000 action 0x6 frozen
[382950.946992] ata15: SError: { PHYRdyChg CommWake 10B8B }
[382950.946995] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.946997] ata15.00: cmd 61/88:98:e8:13:58/00:00:1c:00:00/40 tag 19 ncq dma 69632 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947004] ata15.00: status: { DRDY }
[382950.947006] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947007] ata15.00: cmd 61/f8:b0:d0:b6:72/00:00:1f:00:00/40 tag 22 ncq dma 126976 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947013] ata15.00: status: { DRDY }
[382950.947014] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947015] ata15.00: cmd 61/28:b8:40:e6:0b/00:00:5b:00:00/40 tag 23 ncq dma 20480 out
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947021] ata15.00: status: { DRDY }
[382950.947022] ata15.00: failed command: WRITE FPDMA QUEUED
[382950.947023] ata15.00: cmd 61/08:c0:a0:e4:0b/00:00:5b:00:00/40 tag 24 ncq dma 4096 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[382950.947029] ata15.00: status: { DRDY }
[382950.947032] ata15: hard resetting link
[382951.406984] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[382951.409125] ata15.00: configured for UDMA/133
[382951.409147] sd 14:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s
[382951.409151] sd 14:0:0:0: [sdd] tag#19 Sense Key : Illegal Request [current]
[382951.409154] sd 14:0:0:0: [sdd] tag#19 Add. Sense: Unaligned write command
[382951.409156] sd 14:0:0:0: [sdd] tag#19 CDB: Write(10) 2a 00 1c 58 13 e8 00 00 88 00
[382951.409158] I/O error, dev sdd, sector 475534312 op 0x1:(WRITE) flags 0x700 phys_seg 10 prio class 2
[382951.409164] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=243472519168 size=69632 flags=40080c80
[382951.409179] sd 14:0:0:0: [sdd] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s
[382951.409181] sd 14:0:0:0: [sdd] tag#22 Sense Key : Illegal Request [current]
[382951.409183] sd 14:0:0:0: [sdd] tag#22 Add. Sense: Unaligned write command
[382951.409185] sd 14:0:0:0: [sdd] tag#22 CDB: Write(10) 2a 00 1f 72 b6 d0 00 00 f8 00
[382951.409186] I/O error, dev sdd, sector 527611600 op 0x1:(WRITE) flags 0x700 phys_seg 30 prio class 2
[382951.409189] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=270136090624 size=126976 flags=40080c80
[382951.409197] sd 14:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
[382951.409199] sd 14:0:0:0: [sdd] tag#23 Sense Key : Illegal Request [current]
[382951.409201] sd 14:0:0:0: [sdd] tag#23 Add. Sense: Unaligned write command
[382951.409203] sd 14:0:0:0: [sdd] tag#23 CDB: Write(10) 2a 00 5b 0b e6 40 00 00 28 00
[382951.409204] I/O error, dev sdd, sector 1527506496 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 2
[382951.409207] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082277376 size=20480 flags=180880
[382951.409218] sd 14:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s
[382951.409220] sd 14:0:0:0: [sdd] tag#24 Sense Key : Illegal Request [current]
[382951.409222] sd 14:0:0:0: [sdd] tag#24 Add. Sense: Unaligned write command
[382951.409223] sd 14:0:0:0: [sdd] tag#24 CDB: Write(10) 2a 00 5b 0b e4 a0 00 00 08 00
[382951.409224] I/O error, dev sdd, sector 1527506080 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 2
[382951.409227] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082064384 size=4096 flags=180880
[382951.409230] ata15: EH complete
Setting the SATA links to max_performance
caused the problem to go away. The first scrub after setting the new power setting found a couple of checksum errors that were corrected.
Same for me: Suffering from this issue for years now with various Samsung EVO SSDs (latest FW) Tried "everything" (stepwise ... until all enabled together)... still not ok. Most changes made it "better for some days or weeks" , but finally the "unaligned write command" returned.
P.S.: Found this information linking the "Unaligned Write Command" to "Zoned Block Devices":
Since I'm not an expert in this domain: Can someone in this forum comment on this? Is there a possibility that Samsung EVO SSDs exhibit this behavior of zoned block devices? How does ZFS deal with zoned block devices ? ... From the past, I have in mind that e.g. SMR devices are not suited for ZFS ... Is this still true? ... nevertheless: Only SSDs in my case....
Just another "me too" here.
Three spinning rust drives connected to a consumer-grade motherboard's SATA links:
Model Family: Seagate BarraCuda 3.5 (CMR) Device Model: ST1000DM010-2EP102 Serial Number: ZN1VFRPD LU WWN Device Id: 5 000c50 0e41db863 Firmware Version: CC46 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database 7.3/5417 ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jul 19 09:16:31 2023 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled Model Family: Toshiba P300 (CMR) Device Model: TOSHIBA HDWD110 Serial Number: 92MY48ANS LU WWN Device Id: 5 000039 fc5e96dcb Firmware Version: MS2OA9A0 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database 7.3/5417 ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jul 19 09:17:06 2023 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled Model Family: Western Digital Blue Device Model: WDC WD10EZEX-00BBHA0 Serial Number: WD-WCC6Y7VL8C4H LU WWN Device Id: 5 0014ee 26a20209f Firmware Version: 01.01A01 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database 7.3/5417 ATA Version is: ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jul 19 09:17:25 2023 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled
Motherboard:
Base Board Information Manufacturer: ASUSTeK COMPUTER INC. Product Name: TUF B450-PLUS GAMING
CPU:
model name : AMD Ryzen 7 3700X 8-Core Processor
Kernel:
Linux services02 6.3.4-201.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sat May 27 15:08:36 UTC 2023 x86_64 GNU/Linux
All drives consistently pass SMART tests, but a
zfs scrub
would typically fail on that one Western Digital drive consistently. Errors would be logged in dmesg:[382950.946984] ata15.00: exception Emask 0x0 SAct 0x1c80000 SErr 0xd0000 action 0x6 frozen [382950.946992] ata15: SError: { PHYRdyChg CommWake 10B8B } [382950.946995] ata15.00: failed command: WRITE FPDMA QUEUED [382950.946997] ata15.00: cmd 61/88:98:e8:13:58/00:00:1c:00:00/40 tag 19 ncq dma 69632 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [382950.947004] ata15.00: status: { DRDY } [382950.947006] ata15.00: failed command: WRITE FPDMA QUEUED [382950.947007] ata15.00: cmd 61/f8:b0:d0:b6:72/00:00:1f:00:00/40 tag 22 ncq dma 126976 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [382950.947013] ata15.00: status: { DRDY } [382950.947014] ata15.00: failed command: WRITE FPDMA QUEUED [382950.947015] ata15.00: cmd 61/28:b8:40:e6:0b/00:00:5b:00:00/40 tag 23 ncq dma 20480 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [382950.947021] ata15.00: status: { DRDY } [382950.947022] ata15.00: failed command: WRITE FPDMA QUEUED [382950.947023] ata15.00: cmd 61/08:c0:a0:e4:0b/00:00:5b:00:00/40 tag 24 ncq dma 4096 out res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [382950.947029] ata15.00: status: { DRDY } [382950.947032] ata15: hard resetting link [382951.406984] ata15: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [382951.409125] ata15.00: configured for UDMA/133 [382951.409147] sd 14:0:0:0: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s [382951.409151] sd 14:0:0:0: [sdd] tag#19 Sense Key : Illegal Request [current] [382951.409154] sd 14:0:0:0: [sdd] tag#19 Add. Sense: Unaligned write command [382951.409156] sd 14:0:0:0: [sdd] tag#19 CDB: Write(10) 2a 00 1c 58 13 e8 00 00 88 00 [382951.409158] I/O error, dev sdd, sector 475534312 op 0x1:(WRITE) flags 0x700 phys_seg 10 prio class 2 [382951.409164] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=243472519168 size=69632 flags=40080c80 [382951.409179] sd 14:0:0:0: [sdd] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=58s [382951.409181] sd 14:0:0:0: [sdd] tag#22 Sense Key : Illegal Request [current] [382951.409183] sd 14:0:0:0: [sdd] tag#22 Add. Sense: Unaligned write command [382951.409185] sd 14:0:0:0: [sdd] tag#22 CDB: Write(10) 2a 00 1f 72 b6 d0 00 00 f8 00 [382951.409186] I/O error, dev sdd, sector 527611600 op 0x1:(WRITE) flags 0x700 phys_seg 30 prio class 2 [382951.409189] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=270136090624 size=126976 flags=40080c80 [382951.409197] sd 14:0:0:0: [sdd] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s [382951.409199] sd 14:0:0:0: [sdd] tag#23 Sense Key : Illegal Request [current] [382951.409201] sd 14:0:0:0: [sdd] tag#23 Add. Sense: Unaligned write command [382951.409203] sd 14:0:0:0: [sdd] tag#23 CDB: Write(10) 2a 00 5b 0b e6 40 00 00 28 00 [382951.409204] I/O error, dev sdd, sector 1527506496 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 2 [382951.409207] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082277376 size=20480 flags=180880 [382951.409218] sd 14:0:0:0: [sdd] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=30s [382951.409220] sd 14:0:0:0: [sdd] tag#24 Sense Key : Illegal Request [current] [382951.409222] sd 14:0:0:0: [sdd] tag#24 Add. Sense: Unaligned write command [382951.409223] sd 14:0:0:0: [sdd] tag#24 CDB: Write(10) 2a 00 5b 0b e4 a0 00 00 08 00 [382951.409224] I/O error, dev sdd, sector 1527506080 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 2 [382951.409227] zio pool=storage vdev=/dev/sdd1 error=5 type=2 offset=782082064384 size=4096 flags=180880 [382951.409230] ata15: EH complete
Setting the SATA links to
max_performance
caused the problem to go away. The first scrub after setting the new power setting found a couple of checksum errors that were corrected.
I've definitely run into this issue myself, but I would make sure you don't have a bad HDD first since it can also result in this problem. It's suspicious if it's only happening for one drive.
Passing the smart self-checks doesn't mean the drive is good. If you're always getting errors on the WD drive, I would check the smart attributes for any critical attributes that have bad values. E.g., Raw_Read_Error_Rate
, Read_Soft_Error_Rate
, Reported_Uncorrect
, Hardware_ECC_Recovered
, Current_Pending_Sector
, or UDMA_CRC_Error_Count
.
See, e.g. https://www.thomas-krenn.com/en/wiki/Analyzing_a_Faulty_Hard_Disk_using_Smartctl, for an example of how to diagnose a bad HDD.
Also, look at the power on hours in the smart attributes. Some NAS drives can last 5-7 years, but it's not a guarantee. Cheaper drives generally have a shorter lifetime (though not always). I believe WD Blue is on the cheaper side.
Have had the same problem. I have KNOWN GOOD drives that would have a failing SATA link, causing unaligned write errors. The problem was very noticeable when the drives were put in a ZFS pool. Non-ZFS write loads didn't seem to kill the link, weirdly. Using the built-in sata controller on both my B450 motherboards caused the same errors.
None of the fixes in this thread worked, not even the link power management settings. Maybe for Samsung drives there are fixes, but this seems to be its own issue
FIX: Use a 3rd party HBA. I threw them on an LSI 9211 and it has been rock solid for weeks, not even a hiccup. Or, use a non-AMD controller. Far too many issues already, and now this.
Motherboards: Gigabyte B450 AORUS M, MSI B450M PRO-M2 Drives: HGST HUS724040ALA640 4TB, Seagate Exos X20 18TB (ST18000NM003D-3DL103)
Passing the smart self-checks doesn't mean the drive is good. If you're always getting errors on the WD drive, I would check the smart attributes for any critical attributes that have bad values. E.g., Raw_Read_Error_Rate, Read_Soft_Error_Rate, Reported_Uncorrect, Hardware_ECC_Recovered, Current_Pending_Sector, or UDMA_CRC_Error_Count.
To be clear, I've still not seen any errors since setting the link rate.
SMART attributes look good:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 183 183 021 Pre-fail Always - 1808
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 6
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 2529
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 1
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 104
194 Temperature_Celsius 0x0022 111 108 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
Oh yeah, that looks like a relatively new drive.
Just FYI the Unaligned Write error is a bug in Linux's libata:
https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/
Basically libata is not implementing the SAT (SCSI-to-ATA) translation correctly and ends up converting ATA errors it shouldn't, producing nonsense. This is still a real error (probably a link reset or similar), but the SCSI conversion is broken. This only applies if SAT is not done by a controller/storage system, but by Linux.
@lorenz Interesting, that might explain why I've gotton the unaligned write error too.
I've tried everything from updating ssd firmware, maximum link power settings, to no trim, upgrading the linux kernel, downgrading the port speeds to 3gbps. Nothing worked, slower speed just meant it took longer for the problem to resurface.
What has likely fixed it now is a cable replacement. I noticed that only 2 drives were failing and they had a different brand of cables. The new cables are thicker too, so probably they have much better shielding. Definitely try to replace your sata cable if you're seeing this error!
Side note: zfs didn't handle random writes erroring out gently, the pool was broken beyond repair and I had to rebuild from backups.
I am on the same boat here. System is an X570-based mainboard with FCH controllers plus an ASMEDIA ASM1166. The drivers are 8 WD Utrastar 18 TB, 6 of which are known good because they ran on an ARECA RAID controller for 2 years with no problems. The drives are in MBP155 cages, which certainly give another potential contact point failure. Everything running under Proxmox 8.0.4 with Kernel 6.2.x.
I had ZFS errors galore and tried:
Also, I see those errors turning up only after a few hours of time (or lots of data) has passed.
Using the JMB585 could still be no option, even if now drives on the motherboard controller show these errors, because I can probably limit SATA speed with that controller, which was impossible with the ASM1166. I will try that as a last-but-one resort if limiting link power does not resolve this.
I hate the thought of having to use a HBA adapter consuming more power.
P.S.: The JMB585 can be limited to 3 Gbps, Otherwise, no change, I still get errors on random disks. Have ordered an LSI 9211.8i now. However, this points to a real problem in the interaction between libata and ZFS.
P.P.S: I disabled NCQ and the problem is gone. I did not bother to try the LSI controller. Will follow up with some insights.
Just to reiterate on what I wrote about this here, I have a Linux box with 8 WDC 18 TByte SATA drives, 4 of which are connected through the mainboard controllers (AMD FCH variants) and 4 through an ASMEDIA ASM1166. They build a raidz2 running under Proxmox with a 6.2 kernel. During my nightly backups, the drives would regularly fail and errors showed up in the logs, more often than not "unaligned write errors".
First thing to note is that one poster in the thread mentioned that the "Unaligned write" is a bug in libata, in that "other" errors are mapped to this one in the scsi translation code (https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/). Thus, the error itself is meaningless.
In the thread, several possible remedies were offered, such as:
I am 99% sure that it boils down to a bad interaction between OpenZFS and libata with NCQ enabled and I have a theory why this is so: When you look at how NCQ works, it is a queue of up to 32 (or to be exact 31 for implementation reasons) tasks that can be given to the disk drive. Those tasks can be handled in any order by the drive hardware, e.g. in order to minimize seek times. This, when you give the drive 3 tasks, like "read sectors 1, 42 and 2, the drive might decide to reorder them and read sector 42 last, thus saving one seek in the process.
Now imagine a time of high I/O pressure, like when I do my nightly backups. OpenZFS has some queues of its own which are then given to the drives and for each task started, OpenZFS expects a result (but in no particular order). However, when a task returns, it opens up a slot in the NCQ queue, which is immediately filled with another task because of the high I/O pressure. That means that the sector 42 could potentially never be read at all, provided that other tasks are prioritized higher by the drive hardware.
I believe, this is exactly what is happening and if one task result is not received within the expected time frame, a timeout with an unspecific error occurs.
This is the result of putting one (or more) quite large queues within OpenZFS before a smaller hardware queue (NCQ).
It explains why both solutions 6 and probably 7 from my list above cure the problem: Without NCQ, every task must first be finished before the next one can be started. It also explains why this problem is not as evident with other filesystems - were this a general problem with libata, it would have been fixed long ago.
I would even guess reducing SATA speed to 1.5 Gbps would help (one guy reported this) - I bet this is simply because the resulting speed of ~150 MByte/s is somewhat lower than modern hard disks, such that the disk can always finish tasks before the next one is started, whereas 3 Gpbs is still faster than modern spinning rust.
If I am right, two things should be considered:
a. The problem should be analysed and fixed in a better way, like throttling the libata NCQ queue if pressure gets too high, just before timeouts are thrown. This would give the drive time to finish existing tasks. b. There should be a warning or some kind of automatism to disable NCQ for OpenZFS for the time being.
I also think that the parformance impact of disabling NCQ with OpenZFS is probably neglible, because OpenZFS has prioritized queues for different operations anyway.
(I am the OP, I have some experience with linux kernel drivers, and embedded firmware development)
I like how you wrote it all up, but I doubt you can bring a closure to this problem. IMO, the list of "remedies" is basically snake oil, if any of these remedies was "a solution", this bug would be closed long time ago.
I think "NCQ timeout" does not explain this problem: (a) if NCQ specs permitted a queued command to be postponed indefinitely (and cause a timeout), it would be a serious bug in the NCQ specs, unlikely, but one has to read them carefully to be sure. if specs permitted it, it would likely have caused trouble in places other than ZFS, people would have complained, specs would have been fixed. of course it is always possible that specific HDD firmware has bugs and can postpone some queued commands indefinitely (and cause a timeout), even if specs do not permit it. (b) if queued command timeout was "it", disabling NCQ would have been "the solution", and the best we can tell, it is not.
I think we now have to wait for the libata bug fix to make it into production kernels. Then we will see what the actual error is. "unaligned write command" never made sense to me, and now we know it is most likely bogus.
K.O.
I did not imply that NCQ allows a command to be left indefinitely in itself. It can only be postponed by the hardware in that it may reorder the commands in any way it likes. This is just how NCQ works.
Thus, an indefinite postponing can only be occur if someone "pressures" the queue consistenly - actually, the drive is free to reorder new incoming commands and intersperse them with previous ones - matter-of-fact, there is no difference between issuing 32 commands in short succession and issuing a few more only after some have finished. Call that behaviour a design flaw, but I think it exists and the problem in question surfaces only when some other conditions are met.
And I strongly believe that OpenZFS can cause exactly that situation, especially with write patterns of raidz under high I/O pressure. I doubt that this bug would occur with other filesystems where no such complex patterns from several internal queues ever happen.
As to why the "fixes" worked sometimes (or seemed to have worked): As I said, #6 and #7 both disable NCQ. Reducing the speed to 1.5 Gbps will most likely reduce the I/O pressure enough to make the problem go away and other solutions may help people who really have hardware problems.
Also, I have read nobody so far who has tried to disable NCQ and not done something else alongside (e.g. reducing speed as well). I refrained from disabling NCQ first only because I thought it would hurt performance - which it did not. Thus, my experiments ruled out one single potential cause after another, leaving only the disabling of NCQ as the effective cure. I admit that I probably should wait a few more nights before jumping to conclusions, however these problem were consistent with every setup I tried so far. (P.S.: It has been three days in a row now that no problems occured)
Nothing written here nor anything I have tried so far refutes my theory.
I agree there is a slight chance of my WDC drives having a problem with NCQ in the first place - I have seen comments on some Samsung SSDs having that problem with certain firmware revisions. But that would not have gone unnoticed, I bet.
Just FYI the Unaligned Write error is a bug in Linux's libata:
https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/
Unfortunately, this patch was never applied and the issue got no further attention after a short discussion. There also seems to be no other cleanup having been done on this topic, at least I couldn't find anything related in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/drivers/ata/libata-scsi.c
Basically libata is not implementing the SAT (SCSI-to-ATA) translation correctly and ends up converting ATA errors it shouldn't, producing nonsense. This is still a real error (probably a link reset or similar), but the SCSI conversion is broken.
I think you are correct on this, I'm seeing the error also on a system where I put a new 6.0 GBps Harddisk with 512 byte sectors in a older cartridge with old cabling (going to replace this). For this zpool status lists 164 write errors after writing about 100GB and degraded state. The harddisk increased UDMA_CRC_Error_Count SMART raw value from 0 to 3 but otherwise has no problems. The dmesg info also indicates a prior interface/bus error that is then decoded as unaligned write on the tagged command:
[18701828.321386] ata4.00: exception Emask 0x10 SAct 0x3c00002 SErr 0x400100 action 0x6 frozen [18701828.322053] ata4.00: irq_stat 0x08000000, interface fatal error [18701828.322652] ata4: SError: { UnrecovData Handshk } [18701828.323256] ata4.00: failed command: WRITE FPDMA QUEUED [18701828.323885] ata4.00: cmd 61/58:08:b8:da:63/00:00:48:00:00/40 tag 1 ncq dma 45056 out res 40/00:08:b8:da:63/00:00:48:00:00/40 Emask 0x10 (ATA bus error) [...] [18701830.479670] sd 4:0:0:0: [sdo] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s [18701830.479679] sd 4:0:0:0: [sdo] tag#1 Sense Key : Illegal Request [current] [18701830.479687] sd 4:0:0:0: [sdo] tag#1 Add. Sense: Unaligned write command [18701830.479696] sd 4:0:0:0: [sdo] tag#1 CDB: Write(16) 8a 00 00 00 00 00 48 63 da b8 00 00 00 58 00 00
The reason it never got applied is mostly because as it turns out this is a deeper architectural issue with libata, there is no valid SCSI error code here. Sadly I'm not familiar enough with Linux SCSI midlayer to implement the necessary changes.
CRC errors are not the only type of link error. You are probably losing the SATA link which causes reset/retraining which is one of the known things libata doesn't handle correctly.
I have just got this problem on a brand new WD Red WD40EFPX. I bought it yesterday to replace the failed mirror drive. Getting these errors while ZFS is resilvering it.
No question about faulty controller, cable or anything else hardware-related. The system worked for a very long time, the failed component I have replaced is the disk. The new disk is unlikely to be bad. One way or another, it is related to the new disk interaction with the old system.
This is not really a ZFS issue, it's a hardware/firmware issue being handled badly by the Linux kernel's SCSI subsystem. This issue should probably be closed here.
@ngrigoriev These errors are in pretty much all cases hardware/firmware-related. Post the kernel log if you want me to take a look at the issue.
@lorenz
I was able to finish the resilvering process, but at the very end it started failing again. And this was with "libata.force=3.0 libata.force=noncq"
Right after reboot:
[ 74.500345] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 74.500372] ata3.00: failed command: WRITE DMA EXT
[ 74.500391] ata3.00: cmd 35/00:08:e8:28:00/00:00:16:00:00/e0 tag 23 dma 4096 out
res 40/00:20:58:6f:05/00:00:00:00:00/e0 Emask 0x4 (timeout)
[ 74.500409] ata3.00: status: { DRDY }
[ 74.500423] ata3: hard resetting link
[ 74.976320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[ 74.977298] ata3.00: configured for UDMA/133
[ 74.977339] sd 2:0:0:0: [sdb] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=31s
[ 74.977347] sd 2:0:0:0: [sdb] tag#23 Sense Key : Illegal Request [current]
[ 74.977354] sd 2:0:0:0: [sdb] tag#23 Add. Sense: Unaligned write command
[ 74.977363] sd 2:0:0:0: [sdb] tag#23 CDB: Write(16) 8a 00 00 00 00 00 16 00 28 e8 00 00 00 08 00 00
[ 74.977373] blk_update_request: I/O error, dev sdb, sector 369109224 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 74.977403] zio pool=******** vdev=/dev/disk/by-id/ata-WDC_WD40EFPX-68C6CN0_WD-WX12D14P6EH0-part1 error=5 type=2 offset=188982874112 size=4096 flags=180880
[ 74.977433] ata3: EH complete
Drive:
Device Model: WDC WD40EFPX-68C6CN0
Serial Number: WD-WX12D14P6EH0
LU WWN Device Id: 5 0014ee 216332baf
Firmware Version: 81.00A81
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Sun Jun 9 09:50:07 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 100 253 021 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 1
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 0
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 3
194 Temperature_Celsius 0x0022 111 107 000 Old_age Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 8
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
I understand that it is not zfs fault directly, but, interestingly enough, it is triggered by ZFS specifically. This machine has 5 HDDs. None of them demonstrated this issue for years. It only started to happen with this new WD Red drive that replaced the failed one.
Tried everything I have read about. All combination of the options, libata.force (noncq,3.0, even 1.5). Nothing really worked. At most after a couple of hours, even after a successful resilvering of the entire drive, a bunch of errors would just appear. Also I have noticed that if I set the speed to 1,5Gbps, then other drives on this controller start getting the similar problems.
SCSI power settings are forced to max_power for all hosts.
I have a combination of different drives in this home NAS, some are 6Gbps, some are 3.0.
What I am trying now, I have connected this new drive to the second SATA port on the motherboard instead of the PCIE SATA controller. And I have removed all libata settings. So far so good, keeping fingers crossed. If that does not help, then I am out of options. I have already changed the cable to be sure. Another controller? Well, I have already tried two different ones effectively: ASMedia and the onboard Intel one.
Basically what's happening is that your disk does not respond within 7 or 10s (the Linux ATA command timeout) to a write command the kernel sent. ATA does not have a good way to abort commands (SCSI and NVMe do), so the kernel "aborts" the command by resetting the link. Problem is the link reset is improperly implemented in Linux, resulting in spurious write errors and bogus error codes. Basically Linux does not automatically retry outstanding I/O requests, instead failing them with an improper error code as this is not standardized behavior and as such doesn't have a proper error code.
Unless you've hot-plugged the disk I suspect you have either a cabling issue or one side of the link (either the SATA controller or the disk controller) is bad as we see CRC errors on the link. This would explain the weird timeouts as the command might have been dropped due to a bad CRC.
Yes, it seems so, and it only happens under heavy write activity, apparently.
Is there a way to control this timeout? (I understand it is not the place to ask this kind of question :( )
Reporting an unusual situation. Have ZFS mirror array across two 1TB SSDs. It regularly spews "Unaligned write command" errors. From reading reports here and elsewhere, this problem used to exist, was fixed years ago, not supposed to happen today. So, a puzzle.
It turns out that the two SSDs report different physical sector size, one reports 512 bytes, one reports 4096 bytes. Same vendor, same model, same firmware. (WTH?!?)
zpool reports default ashift 0 (autodetect_. zdb reports ashift 12 (correct for 4096 sectors)
So everything seems to be correct, but the errors are there.
The "unaligned write command" errors only some from the "4096 bytes" SSD. After these write errors, "zpool scrub" runs without errors ("repaired 0B"). Two other zfs mirror pools on the same machine run without errors (2x10TB and 2x6TB disks, all report 4096 physical sector size).
K.O.