Bad latency reading large files from mounted volumes

evandeaubl commented 2 years ago

Bug Report

Description

When I attempt to read large files (multiple GB) in volumes mounted into pods on a Talos cluster, the time to read just one byte out of the file is insanely long, and seems to scale with the size of the file. I've watched transfer metrics, and it looks like the entire file is getting read in at open time (!?!). None of the readahead settings I know of look out of whack (read_ahead_kb setting for the device is 128KB, filesystem readaheads look okay, but probably aren't relevant in light of the info in the next paragraph).

So far, it does not matter what filesystem is mounted, or what CSI the mount is using; I have replicated using RBD, CephFS, geesefs S3, OpenEBS lvm-localpv, and plain old built-in local volumes. hostPath mounts and local volume mounts on the system disk are the only mounts where I haven't been able to replicate this behavior.

I have not been able to replicate this issue when I installed the same version of k3s or k8s via kubeadm in the same environment, which is why I'm thinking this is a Talos issue.

Logs

Haven't found anything in logs that seems relevant, although when I add an strace to the second dd in the reproducer below, the long hang occurs during the openat() call opening the file from the mounted volume. Let me know if there are any logs you would like me to provide, but hopefully the reproducer is simple enough that you can replicate in your environment.

Environment

Talos version: 1.2.7
Kubernetes version: 1.25.4
Platform: metal

Reproducer

Spin up a QEMU cluster using sudo --preserve-env=HOME talosctl cluster create --provisioner qemu --extra-disks 1 --extra-disks-size 20480.
Deploy the PV/PVC/pod from the attached manifests.

pvc.yaml.txt alpine.yaml.txt

Run kubectl exec -it pod/alpine -- /bin/sh
Run dd if=/dev/zero of=/data/zerofile bs=4M count=4096 to create a 16GB file.
Run time dd if=/data/zerofile of=/dev/null bs=1K count=1 to read the first KB of that file.
Observe that time to run second dd is very long.

Sample reproducer output (on my dev laptop with QEMU VMs running on NVME)

[evan@nitrogen test]$ kubectl exec -it pod/alpine -- /bin/sh
/ # dd if=/dev/zero of=/data/zerofile bs=4M count=4096
4096+0 records in
4096+0 records out
/ # time dd if=/data/zerofile of=/dev/null bs=1K count=1
1+0 records in
1+0 records out
real    0m 46.57s
user    0m 0.00s
sys 0m 46.50s
/ #

smira commented 1 year ago

I can reproduce this issue, and it doesn't seem to be anything Talos at the moment, it's a regular mount on the host.

Also interesting that it only happens for the first time, if you run the command once again, it succeeds immediately. It might be something ext4-related. Needs more digging.

smira commented 1 year ago

At the time when it "hangs", it's blocked in the Linux kernel:

smira commented 1 year ago

Re-doing your reproducer with fsType: xfs in the PV "fixes" the problem:

# time dd if=/data/zerofile of=/dev/null bs=1K count=1
1+0 records in
1+0 records out
real    0m 0.00s
user    0m 0.00s
sys 0m 0.00s

So I think it's something ext4 specific coupled with slow I/O performance of the QEMU volume (talosctl cluster create was never optimized for performance, probably with .qcow volumes it would be way better).

smira commented 1 year ago

okay, I know that this is, and thanks for reporting this bug!

siderolabs / talos