openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.38k stars 1.72k forks source link

OpenZFS 2.2 + `overlayfs` / Docker: slow unmounts (`umount2`) #15581

Open problame opened 8 months ago

problame commented 8 months ago

System information

Type Version/Name
Distribution Name
Distribution Version
Kernel Version 6.5.11
Architecture x86_64
OpenZFS Version 2.2.1

Describe the problem you're observing

overlayfs is slow to unmount on top of ZFS, i.e., it takes multiple txg syncs to unmount. My use case is with overlayfs lowerdir, upperdir, and workdir all the same ZFS dataset, let's focus on that case in this issue to keep things simple.

Why Is This Relevant?

Slow overlayfs unmounts slows down usage of Docker when starting and stoppping containers for docker build and docker run. Other container runtimes using overlayfs are equally affected. Other use cases of overlayfs are equally affected.

It's a shame because OpenZFS 2.2 added support for whiteouts, and thereby, performant support for using the overlay2 driver in Docker.

NB: pre OpenZFS 2.2, the only practical option for a CoW Docker graph driver on top of ZFS was their "ZFS Storage Driver" which built on top ZFS clones.

Describe how to reproduce the problem

Setup

Repro (do it a few times, sometimes it's fast, sometimes it takes multiple seconds)

docker create hello-world

Include any warning/errors/backtraces from the system logs

N/A

Initial Analysis

I initially hunted this down to the unmount system call using

strace -T -e umount2  -f -p "$(pgrep dockerd)"
strace: Process 308501 attached with 13 threads
[pid 308508] umount2("/var/lib/docker/overlay2/acdb91b717aec8ddc06ab191a87b16137b96a03882c9434c2a4e6013df073a19-init/merged", MNT_DETACH) = 0 <0.997613>
[pid 308508] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=308501, si_uid=0} ---
[pid 308508] umount2("/var/lib/docker/overlay2/acdb91b717aec8ddc06ab191a87b16137b96a03882c9434c2a4e6013df073a19/merged", MNT_DETACH) = 0 <5.113929>
[pid 308508] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=308501, si_uid=0} ---

Ignore the SIGURG, it's Go runtime stuff that doesn't matter here (dockerd is written in Go). We see the unmount took 5.113929 seconds of wall clock time.

I then used bpftrace to further hunt it down to sync_inodes_sb:

bpftrace -e '
kfunc:$1 { if (@start[tid] != 0) { printf("recursion!!!\n"); } @start[tid] = nsecs; }
kretfunc:$1 /@start[tid] != 0/ {
  $elapsed = (uint64) nsecs - @start[tid];
  $millisecond = (uint64) 1 * 1000 * 1000;
  if ($elapsed/$millisecond >= $2) {
    printf("%s: %d %s %llums\n", probe, pid, comm, $elapsed/$millisecond);
  }
  delete(@start[tid]);
}' sync_inodes_sb 0
Attaching 2 probes...
kretfunc:vmlinux:sync_inodes_sb: 308501 dockerd 0ms
kretfunc:vmlinux:sync_inodes_sb: 308501 dockerd 139ms
kretfunc:vmlinux:sync_inodes_sb: 308501 dockerd 0ms
kretfunc:vmlinux:sync_inodes_sb: 308501 dockerd 5113ms

Workaround

When setting sync=always on the dataset that hosts /var/lib/docker, the reproducer is always fast, obviously at the expense of all IO now being sync.

I got that idea when reading the writeback code in Linux mainline and the zfs_vnops_os.c's various inspections of the wbc->sync_mode for WB_SYNC_NONE vs other values.

I'm not too familiar with the writeback code but am happy to learn & test.

allanjude commented 8 months ago

Do you know how much pending data is waiting around to be flushed when you trigger the unmount?

problame commented 8 months ago

Do you know how much pending data is waiting around to be flushed when you trigger the unmount?

It's an insignificant amount.

Here's a video of the reproducer with zpool iostat and iostat output.

https://github.com/openzfs/zfs/assets/956573/5231aa52-1c8b-4f3b-98c5-3c884f096b52

deviantintegral commented 8 months ago

I came across this from the thread at https://forum.proxmox.com/threads/proxmox-8-1-zfs-2-2-docker-in-privileged-container.137128/.

I've been testing the new overlay2 features and my environment is a little different.

I wasn't able to reproduce this originally, but it seems to trigger more often as time goes on. Perhaps related to the number of containers created and deleted?

image

I ran docker system prune -af after each test run to make sure things were as clean as possible. I'll say in my case, the "slow" runs are only a second or so.

jkerdreux-imt commented 4 months ago

I run into the same issue. The time result of an docker create on ZFS is not consistent.

` jkx > deb12 > ~ >> time docker create hello-world 7b1f0e289479d57a29034db5805f6dbee535479683d642e202b1c4ab51150816


Executed in 282.36 millis fish external usr time 50.31 millis 380.00 micros 49.93 millis sys time 11.53 millis 197.00 micros 11.33 millis

jkx > deb12 > ~ >> time docker create hello-world 61dd9256aaf1090c5792aae774b2b55d42213f295251b6435024288c39b38918


Executed in 2.19 secs fish external usr time 45.60 millis 458.00 micros 45.15 millis sys time 19.57 millis 239.00 micros 19.33 millis

jkx > deb12 > ~ >> time docker create hello-world ddc11ff70270e7cde18bc5a9780f799f47cf79892a84a09cc89ed69d6dcaed15


Executed in 1.14 secs fish external usr time 46.20 millis 449.00 micros 45.75 millis sys time 19.04 millis 233.00 micros 18.81 millis `

Exactly the same test (same host, kernel, OS..) on standard ext4 is way more consistent and faster

` jkx > deb12 > ~ >> time docker create hello-world 234fe1ea92831464b19d352344c0bb59ca6dff36ddfafc9d62dd9153ee744fed


Executed in 79.44 millis fish external usr time 40.44 millis 322.00 micros 40.11 millis sys time 18.63 millis 213.00 micros 18.42 millis

jkx > deb12 > ~ >> time docker create hello-world e63b446d612cc36189b21964ec411bacff28a1e61ccc49b65a013350e0f527ee


Executed in 76.91 millis fish external usr time 50.32 millis 300.00 micros 50.02 millis sys time 9.12 millis 202.00 micros 8.91 millis

jkx > deb12 > ~ >> time docker create hello-world 8f55e8b6481fbd3eeb13a2c12c5de2dbd3c564c57c21bd353b88e72202af1536


Executed in 75.72 millis fish external usr time 52.05 millis 301.00 micros 51.75 millis sys time 6.61 millis 203.00 micros 6.40 millis `