OpenZFS for Linux interaction problem with NCQ - potential data loss

meyergru commented 1 year ago

System information

Linux x64 Box --- | --- Proxmox 8.04 | kernel 6.2.16-12-pve | x64 | OpenZFS zfs-2.1.12-pve1 | 8x WDC connected through SATA

Describe the problem you're observing

There is an old issue which partly relates to this, but I think it is not classified as a bug - and what is worse, one that leads to data destruction.

Just to reiterate on what I wrote about this here: (https://github.com/openzfs/zfs/issues/10094#issuecomment-1707993156), I have a Linux box with 8 WDC 18 TByte SATA drives, 4 of which are connected through the mainboard controllers (AMD FCH variants) and 4 through an ASMEDIA ASM1166. They build a raidz2 running under Proxmox with a 6.2 kernel. During my nightly backups, the drives would regularly fail (sometimes "degraded" and somtimes "failed" and errors showed up in the system log, more often than not "unaligned write errors".

First thing to note is that one poster in the thread mentioned that the "Unaligned write" is a bug in libata, in that "other" errors are mapped to this one in the scsi translation code (https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/). Thus, the actual error message is meaningless.

In the old issue, several possible remedies were offered, such as:

Faulty SATA cables (I replaced them all, no change, but I admit this could be the problem in some cases)
Faulty disks (Mine were known to be good, and also, errors were randomly distributed among them)
Power saving in the SATA link or the PCI bus (disabling this did not help)
Problematic controllers (Both the FCH and the ASM1166 chips as well as a JMB585 showed the same behaviour)
Limiting SATA speed to SATA 3.0 Gbps or even to 1.5 Gbps (3.0 Gbps did not help, and was not even possible with the ASM1166 as the speed was always reset to 6.0 Gbps, but I could check with FCH and JMB585 controllers)
Disabling NCQ (guess what, this helped!)
Replacing the SATA controllers with an LSI 9211-8i (I guess this would have helped, as others have reported, because it probably does not use NCQ)

I am 99% sure that it boils down to a bad interaction between OpenZFS and libata with NCQ enabled and I have a theory why this is so: When you look at how NCQ works, it is a queue of up to 32 (or to be exact 31 for implementation reasons) tasks that can be given to the disk drive. Those tasks can be handled in any order by the drive hardware, e.g. in order to minimize seek times. This, when you give the drive 3 tasks, like "read sectors 1, 42 and 2, the drive might decide to reorder them and read sector 42 last, thus saving one seek operation in the process.

Now imagine a time of high I/O pressure, like when I do my nightly backups. OpenZFS has some queues of its own which are then given to the drives and for each task started, OpenZFS expects a result (but in no particular order). However, when a task returns, it opens up a slot in the NCQ queue, which is immediately filled with another task because of the high I/O pressure. That means that the sector 42 could potentially never be read at all, provided that other tasks are prioritized higher by the drive hardware.

I believe, this is exactly what is happening and if one task result is not received within the expected time frame, a timeout or an unspecific error occurs which is then reflected as "unaligned write".

IMHO, this is the result of putting one (or more) queues within OpenZFS in front of a smaller hardware queue (i.e. NCQ).

It explains why both solutions 6 and probably 7 from my list above cure the problem: Without NCQ, every task must first be finished before the next one can be started. It also explains why this problem is not as evident with other filesystems - were this a general problem with libata, it would have been fixed long ago.

I would even guess reducing SATA speed to 1.5 Gbps would help (one guy reported this) - I bet this is simply because the resulting speed of ~150 MByte/s is somewhat lower than modern hard disks, such that the disk can always finish tasks before the next one is started, whereas 3 Gpbs is still faster than modern spinning rust.

If I am right, two things should be considered:

a. The problem should be analysed and fixed in a better way than just disabling NCQ, like throttling the libata NCQ queue if pressure gets too high, just before errors are thrown. This would give the drive time to finish existing tasks. b. There should be a warning or some kind of automatism to disable NCQ for OpenZFS for the time being.

I also think that the performance impact of disabling NCQ with OpenZFS is probably neglible, because OpenZFS has prioritized queues for different operations anyway.

Describe how to reproduce the problem

Create a raidz2, copy a large number of files to it, preferably from a fast source like an NVMe disk.

Include any warning/errors/backtraces from the system logs

Irrelevant because of another bug in the libata/scsi abstraction layer, see: https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/

mabod commented 1 year ago

I am wondering how this is related to the IO scheduler. Have you tested this with mq-deadline, kyber or bfq?

amotin commented 1 year ago

Do you have any evidences of command timeouts in your tests?

Last year I was specifically testing different HDDs for behavior under a mix of sequential and random reads. I saw that disks indeed prioritize sequential reads to stay more efficient and reduce the number of head seeks. But on all HDDs I tested I also saw a hard deadline between 1 and 4 seconds, depending on a model, where firmware broke the linear I/O pattern and went executing random I/Os. So there should be no timeouts from that as long as HDD firmware is sane.

As result of that investigation I actually made improvement to ZFS I/O scheduler to explicitly delay low-priority I/Os if high-priority ones are not completing for too long: https://github.com/openzfs/zfs/pull/11166 . Obviously it works only in one direction, but still should reduce chances of starvation scenarios you are describing.

On top of that, at least FreeBSD ATA/SCSI stack explicitly injects non-queued commands every half command timeout interval. It forces disk queue flush in case drive does not do it right. Supposedly it happened on some old SCSI disks. I am not sure it is really needed these days, but it does not make too much harm, so it is still there. I don't know if Linux has any similar mechanism, but it could.

In any case the command timeouts are not ZFS problem, but the disk driver that implements them. ZFS itself would "happily" wait forever, if just has no other choice. And disabling NCQ is a bad idea, since only the HDD's firmware can schedule multiple I/Os more efficiently by knowing internal disk physical characteristics.

meyergru commented 1 year ago

I am wondering how this is related to the IO scheduler. Have you tested this with mq-deadline, kyber or bfq?

No, that is a productive system, so I am glad to have it working again by disabling NCQ.

Do you have any evidences of command timeouts in your tests?

No, as I wrote, the error messages are unspecific in that they are in an "else" branch which catches whatever is not handled specifically.

On top of that, at least FreeBSD ATA/SCSI stack explicitly injects non-queued commands every half command timeout interval. It forces disk queue flush in case drive does not do it right. Supposedly it happened on some old SCSI disks. I am not sure it is really needed these days, but it does not make too much harm, so it is still there. I don't know if Linux has any similar mechanism, but it could.

I do not know if that exists, but I agree that it should. And I admit it could be that the drive firmware does not set a hard deadline. I cannot investigate because I only have one type of drive.

In any case the command timeouts are not ZFS problem, but the disk driver that implements them. ZFS itself would "happily" wait forever, if just has no other choice. And disabling NCQ is a bad idea, since only the HDD's firmware can schedule multiple I/Os more efficiently by knowing internal disk physical characteristics.

Probably, however I would argue that the behaviour of the underlying drivers is just at it is and OpenZFS is potentially making assumptions about how the driver "should" behave - which it probably does for FreeBSD (which it originally was designed for), as you say, but probably not for Linux. That it why I titled the defect to reflect the interaction between OpenZFS and NCQ on Linux.

And as for the "bad idea": I rather have a reliable array than an optimized one for the time being. But you are correct, the way to go is to fix the problem even with NCQ turned on.

mabod commented 1 year ago

Back to my question: Is this somehow influenced by the IO scheduler? I assume you are using "none". Would it make any difference if you use mq-deadline or bfq?

meyergru commented 1 year ago

Back to my question: Is this somehow influenced by the IO scheduler? I assume you are using "none". Would it make any difference if you use mq-deadline or bfq?

I do not know if changing it would help (it is mq-deadline now) and as I said: This being a productive system with over 60 TByte worth of data, I am not going to experiment on it. Every time those errors occur, I have to scrub the whole array for > 24 hours and hope that no files are corrupted after this (been there - done that). The experiments have taken me the last three weeks until I found that disabling NCQ would have been the fix in the first place, while buying two now useless SATA controllers on the way.

spixx commented 1 year ago

I would like to point out that I am experiencing a similar issue, I do not run this in production (homelab) so I might be able to assist. When setting libata.force=noncq in my KERNEL boot line it works "flawlessly". (running on proxmox with a ASMedia controller).

ashleyw-gh commented 1 year ago

just a comment, I've been running OpenZFS for years, but we recently switched to Linux raid (using md), and after disabling NCQ our throughput went up between 5 and 10 fold on a Veeam Active Full job. (this is using 22 disk raid 10 8TB Toshiba drives - spinning rust). so I don't believe this issue is specific to OpenZFS but a more generalised issue. In our case we have a cron job with a @reboot task to run this script at boot time to make sure NCQ is disabled for all our drives. Sadly I don't have access to spare hardware currently to reproduce the issue on OpenZFS currently.

for drive in sd{b..x};do
  NCQDisabled=`cat /sys/block/$drive/device/queue_depth`
  #echo $drive $NCQDisabled
  if [ "$NCQDisabled" != "1" ]; then
    echo "disabling NCQ for $drive"
    echo 1 > /sys/block/$drive/device/queue_depth
  else
    echo "NCQ already disabled for $drive"
  fi
done

richardelling commented 10 months ago

FWIW, NCQ has a long, sordid history of breakage. So it is not surprising we continue to find more. Clearly there are other integration points in the Linux stack that cause problems. However, it is safe the SysFS queue_depth on-the-fly. You might consider using a udev rule instead of a systemd solution, because it would also handle the hot-plug case and you can restrict it to ATA drives. Yes, I do mean to imply that native SCSI is better than ATA, NCQ is just one area where ATA sucks rocks.

meyergru commented 3 months ago

It has been a while since the original problem turned up. In my case, it was definitely neither caused by a hardware issue with the drives, cabling nor by a driver / chipset problem, see also this issue:

https://github.com/openzfs/zfs/issues/10094

However, I think that the original problem was within OpenZFS alone and may be fixed by now:

What I found is the hint to this pull: https://github.com/openzfs/zfs/pull/15414 leads to this pull: https://github.com/openzfs/zfs/pull/15588. That pull has some interesting notes which could explain the errors reported here completely and are not too far off my suspicions about I/O pressure causing this.

In the end, the final pull that has been accepted was https://github.com/openzfs/zfs/pull/16032 and it is contained in OpenZFS 2.2.4. To verify that the fix is present in your applicable Linux version, you can look cat "/sys/module/zfs/parameters/zfs_vdev_disk_classic". If this is present and has the value 1, this fix is present and you may probably safely remove the libata.force=noncq. I did this on my Proxmox 8.2.4 installation, which now has OpenZFS 2.2.4 under the hood, instead of 2.12 in late 2023.

If you still experience problems, their root cause may be something different than the underlying OpenZFS problem from 2020.

Therefore I close this issue now.

robn commented 3 months ago

Not quite: zfs_vdev_disk_classic=1 is to use the "classic" version, that is, the same code that has existed since forever. Set it to 0 to use the new submission method.

(I have no opinion on this particular issue; just pointing out the inverted option).

mabod commented 3 months ago

@meyergru : I am trying to understand what the status quo is on your side.

I understand that your are running with zfs_vdev_disk_classic=1. Is that correct?

And you have NCQ enabled and no issue with zfs 2.2.4. Is that correct?

meyergru commented 3 months ago

Correct. What I have seen is that none of the Proxmox machines I installed in the last few months have experienced the problem, despite the fact I had not used noncq on those. All of these were Intel instead of AMD, so I first thought the problem was AMD only.

However, after I found that something has changed for OpenZFS 2.2.4 and finding that all of the non-affected machines were installed with newer Proxmox versions, I have turned off NCQ for my own (updated) machine and it seems fine now.

I cannot say if parts of the patch that are used even when zfs_vdev_disk_classic=1 or if other improvements in OpenZFS caused this, all I can say is that with OpenZFS 2.2.4 and default parameters (i.e. NCQ enabled and default "zfs_vdev_disk_classic=1") now works for me where it did not previously. After all, also there is a new kernel (I think 6.8.12 instead of 6.2-something).

openzfs / zfs