openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.55k stars 1.74k forks source link

high iowait and nothing running with zstd with recordsize=1M #13432

Open justinpryzby opened 2 years ago

justinpryzby commented 2 years ago
Type Version/Name
Distribution Name Centos
Distribution Version 7
Kernel Version 3.10.0-1160.62.1.el7
Architecture x86_64
OpenZFS Version 2.1.4-1

Problem started several weeks ago, probably due to changes in our application. Postgresql and ZFS are running on a KVM/QEMU VM on hardware we control. Postgresql is writing to a ZFS tablespace, loading data with INSERT. We improved the speed of our loader, so the INSERTs are now happening ~2x as fast. Our largest customer writes ~42GB of data per day (before compression) and has now seen this issue a handful of times. On the tablespace housing the large table: zfs/cdr_table compressratio 9.28x - zfs/cdr_table compression zstd-12 local zfs/cdr_table recordsize 1M inherited from zfs

top shows nothing using the CPU, but load is high: load average: 23,09, 23,09, 22,86 And io/wait is very high: 37,5 wa

In the past, I've had to forcibly shutdown the VM - postgres will not shut down cleanly, and various processes are clearly "stuck". I'm currently unable to strace the 2 processes that are inserting into the large table - they're stuck in kernel-space. [pryzbyj@database ~]$ ps -O wchan========================= 28758 1794 PID ======================== S TTY TIME COMMAND 1794 futex_wait_queue_me S ? 00:00:04 postgres: telsasoft ts 192.168.122.13(53954) INSERT 28758 dbuf_destroy D ? 00:00:13 postgres: telsasoft ts 192.168.122.13(53600) INSERT

Maybe this is important? 2102 zil_commit_impl D ? 07:41:04 postgres: checkpointer

There are many processes stuck in: 1794 futex_wait_queue_me S ? 00:00:04 postgres: telsasoft ts 192.168.122.13(53954) INSERT

The first badness I see in logs the last coup;e times this happened looks like: [463031.727764] INFO: task dbuf_evict:620 blocked for more than 120 seconds.

[111965.770913] INFO: task txg_sync:1825 blocked for more than 120 seconds. [111965.770916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [111947.111520] INFO: rcu_sched detected stalls on CPUs/tasks:

My logs show that I had to forcibly reboot on Apr 21, Apr 27 and now on May 6.

One recent change that I made was to moved postgres' WAL onto ZFS, using compress=lz4 and recordsize=8kb. As an experiment, I'm going to move it back to ext4 now. It seems like a ZFS issue in any case. dmesg-zfs-hang4.log

rincebrain commented 2 years ago

See #13409 for some useful pointers and ideas, I think.

In general, I would not suggest going over zstd-9 for just about any reason, and usually not even that high - not even for interactive side effects, just because the gains compared to the costs are usually absolutely trivial.

justinpryzby commented 2 years ago

I saw that issue, but I don't think there's any usable ideas there. In my case, the processes are stuck for minutes or hours. Any suggestion what other information to collect if it continues to happen ?

zstd-9 may be a good rule of thumb, but the vast majority of our data is strictly read-only, and compresses well, so it makes sense to put a extra effort into its initial write, even if it'll only be 10% smaller.

rincebrain commented 2 years ago

Well, stuck for minutes on end might mean something's broken.

Those stuck processes the NMI watchdog is complaining about are lspci trying to reach out and read the PCI(e) bus and getting stuck, and it would be pretty remarkable if ZFS broke that.

This seems to be a VM - are you, perhaps, doing PCIe passthrough of something into the VM?

justinpryzby commented 2 years ago

No, we're not using PCI passthrough. Other than storage, the guest has a virtio NIC. The storage is a fibre SAN on the host, with LVM on top, the LVs being exposed to the QEMU guest as a block device (I realize this is not considered to be an ideal configuration for ZFS).

Evidently, lspci is being run by bash: /etc/profile.d/qt-graphicssystem.sh: if ( /usr/sbin/lspci 2>/dev/null | grep -qi "VGA compatible controller: Cirrus Logic GD 5446" ) ; then ... [pryzbyj@database ~]$ rpm -qf /etc/profile.d/qt-graphicssystem.sh qt-settings-19-23.12.el7.centos.noarch

rincebrain commented 2 years ago

I might be suspicious of something in the PCIe stack going wrong, then, if it's hanging - I have in the past had to switch away from using virtio NICs when they would randomly hang or stop delivering packets with no warning or way to resolve short of restarting the VM.

On Fri, May 6, 2022 at 10:43 PM justinpryzby @.***> wrote:

No, we're not using PCI passthrough. Other than storage, the guest has a virtio NIC. The storage is a fibre SAN on the host, with LVM on top, the LVs being exposed to the QEMU guest as a block device (I realize this is not considered to be an ideal configuration for ZFS).

Evidently, lspci is being run by bash: /etc/profile.d/qt-graphicssystem.sh: if ( /usr/sbin/lspci 2>/dev/null | grep -qi "VGA compatible controller: Cirrus Logic GD 5446" ) ; then ... @.*** ~]$ rpm -qf /etc/profile.d/qt-graphicssystem.sh qt-settings-19-23.12.el7.centos.noarch

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/13432#issuecomment-1120116404, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7IUTXXPXDTW2V5GRO3VIXKELANCNFSM5VJTYJFA . You are receiving this because you commented.Message ID: @.***>

jittygitty commented 2 years ago

@justinpryzby You said you used lz4 before zstd, were there any similar issues when using lz4?

Personally I think lz4 may be a 'safer' bet until zstd sweet spot is ironed out for various use cases, there's some PRs and patches by @rincebrain and others I think meant to improve zstd. I've seen others have issues with zstd, so again how did lz4 compare?

I'm guessing you don't just use zfs/zpool instead of LVM because you need to use other filesystems on top LVM and ZVOL can't meet your needs as well or some other reason?

justinpryzby commented 2 years ago

The NIC works fine, and I was able to log into the server via SSH during the issue.

This is still an issue for us requiring to violently power off/on the VM every couple weeks.

On Fri, May 06, 2022 at 08:04:01PM -0700, Rich Ercolani wrote:

I might be suspicious of something in the PCIe stack going wrong, then, if it's hanging - I have in the past had to switch away from using virtio NICs when they would randomly hang or stop delivering packets with no warning or way to resolve short of restarting the VM.

On Fri, May 6, 2022 at 10:43 PM justinpryzby @.***> wrote:

No, we're not using PCI passthrough. Other than storage, the guest has a virtio NIC. The storage is a fibre SAN on the host, with LVM on top, the LVs being exposed to the QEMU guest as a block device (I realize this is not considered to be an ideal configuration for ZFS).

Evidently, lspci is being run by bash: /etc/profile.d/qt-graphicssystem.sh: if ( /usr/sbin/lspci 2>/dev/null | grep -qi "VGA compatible controller: Cirrus Logic GD 5446" ) ; then ... @.*** ~]$ rpm -qf /etc/profile.d/qt-graphicssystem.sh qt-settings-19-23.12.el7.centos.noarch

justinpryzby commented 1 year ago

@stalebot: to mention that this is still an issue on this server using zfs 2.1.6 and 3.10.0-1160.76.1.el7.x86_64. BTW, I'm not sure that recordsize/compress are actually related, but I mentioned them because they're set to non-default values.

justinpryzby commented 1 year ago

Data point: since upgrading to kernel-lt-5.4.223-1.el7.elrepo.x86_64, the issue hasn't recurred in 45 days. Before, it occurred more than once per week. It's possible that the issue started after upgrading the kernel. I'm not sure the exact date/kernel, but before March, 2022, this server was running kernel-3.10.0-1160.45.1.el7.x86_64 and zfs-2.0.6-1.el7.x86_64 with no issues.

Another data point: I have zfs 2.1.7 running on a couple dozen centos7 servers, and this is the only place I've seen this issue. This is also our largest server, loading 2x-3x more data being loaded than anywhere else. There's one 5TiB pool, plus 2 other pools for 2ndary data.

justinpryzby commented 1 year ago

Maybe related to this, but who knows, because in 40 minutes I wasn't able to find even one single line of source code for redhat's kernel.

2022-02-16 - Rado Vrbovsky rvrbovsk@redhat.com [3.10.0-1160.59.1.el7]

justinpryzby commented 1 year ago

Related to #9130 ?