openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.54k stars 1.74k forks source link

very poor server responsiveness, maybe caused by mem leak #12223

Closed justinpryzby closed 3 years ago

justinpryzby commented 3 years ago

System information

Distribution Name | Centos Distribution Version | 7 Linux Kernel | 3.10.0-1127.19.1.el7 Architecture | x86_64 ZFS Version | zfs-2.0.4-1 SPL Version | 2.0.4-1

The server is a VM where zfs was upgraded to 2.0.0 in January, and since upgraded to 2.0.4. It has 128GB zpool using compress=zstd. It has several times had abysmal performance, and high load, and swapping when there's available RAM. I have logs showing things like psql -c "SELECT version()" taking 30sec, high fsync time, timeouts from sudo authentication., and timeouts from postgres protocol negotiation. In addition to terrible interactivity, including on the non-zfs filesystems. The vdev is an LVM LV (I know that this configuration is discouraged), and there are non-zfs filesystems sharing the same LVM PV.

The issue seems to be resolved by "zpool export". We have numerous other servers running zfs 2.0.4 with no issue. The difference may be that this server has several instances of our application. zfs is used as a postgres tablespace, and multiple instances mean that we may have a multiple as many processes simultaneously inserting new data. The loaders are intended to be staggered, but once the server bogs down, all bets are off. It's possible we had 20-30 loaders running at once - that could possibly be both a cause of the problem and also a trigger for additional, continuing problems (?). There's also more nagios monitoring checks, so it's possible there's a race condition or other issue with concurrent access. Or it may just be more obvious because we run ~15 instances of some checks here.

I imagine the issue will be recur within 2 months, so I'm asking in advance what diagnostics I can collect when it does.

justinpryzby commented 3 years ago

This seems to be resolved/mitigated by limiting ARC size. I'll re-open if not. echo $((410241024*1024)) |sudo tee /sys/module/zfs/parameters/zfs_arc_max