saad0805 commented 10 months ago

redhat default value for read_ahead_kb is 64KB

for all persistent volumes using csi driver we see that value is 32Mb

and this is causing degraded io on our applications on ocp cluster.

is this defined somewhere? why is redhat default value overwritten

We are using 2.3.0 operator on redhat openshift

datamattsson commented 10 months ago

Thanks for raising this. I've examined our procedures and we change read_ahead_kb for Nimble devices but we set it to 128KB, not 32MB. I can't find any other traces of read_ahead_kb being touched.

saad0805 commented 10 months ago

thanks. that is weird then

When we were on ocp 4.10 / csi operator 2.2 read_ahead_kb was at 128K

Since the upgrade of csi driver to 2.3 and ocp to 4.12 , read_ahead_kb has been changed to 32M

cat /sys/class/block/dm-*/queue/read_ahead_kb

32768 32768 32768 32768 32768 32768

We try manually changing the value for a Persistent volume :

echo 4096 > /sys/block/dm-27/queue/read_ahead_kb

then empty the cache : echo 1 > /proc/sys/vm/drop_caches

But when we restart the pod, the value is overwritten and comes back to 32M

For info we are using hpe 3par as storage . Worker nodes are on Synergy.

All local disks are at default redhat value of 4096K.

cat /sys/class/block/sda/queue/read_ahead_kb

4096

We have other servers in the same synergy with rhel7 and virtual machines on vmware with rhel7 and rhel8 with vlumes on same storage . it is @ 4096K as well

From your point of view, if csi driver is not setting this, is it on K8s, redhat kernel, storage or server level ?

sc : parameters: accessProtocol: fc allowMutations: compression, hostSeesVLUN compression: "true" cpg: XXXX csi.storage.k8s.io/controller-expand-secret-name: primera3par-secret csi.storage.k8s.io/controller-expand-secret-namespace: kube-system csi.storage.k8s.io/controller-publish-secret-name: primera3par-secret csi.storage.k8s.io/controller-publish-secret-namespace: kube-system csi.storage.k8s.io/fstype: ext4 csi.storage.k8s.io/node-publish-secret-name: primera3par-secret csi.storage.k8s.io/node-publish-secret-namespace: kube-system csi.storage.k8s.io/node-stage-secret-name: primera3par-secret csi.storage.k8s.io/node-stage-secret-namespace: kube-system csi.storage.k8s.io/provisioner-secret-name: primera3par-secret csi.storage.k8s.io/provisioner-secret-namespace: kube-system hostSeesVLUN: "true" provisioning_type: dedup

csi specs : spec: csp: affinity: {} labels: {} nodeSelector: {} tolerations: [] logLevel: info node: affinity: {} labels: {} nodeSelector: {} tolerations: [] disable: alletra6000: true alletra9000: false nimble: true primera: false disableNodeConformance: false iscsi: chapPassword: '' chapUser: '' imagePullPolicy: IfNotPresent disableNodeGetVolumeStats: false controller: affinity: {} labels: {} nodeSelector: {} tolerations: [] registry: quay.io kubeletRootDir: /var/lib/kubelet/

multipath.conf : device { product "VV" features "0" prio alua path_selector "round-robin 0" rr_weight "uniform" path_grouping_policy group_by_prio no_path_retry 18 hardware_handler "1 alua" path_checker tur detect_prio yes rr_min_io_rq 1 fast_io_fail_tmo 10 dev_loss_tmo infinity vendor "3PARdata" failback immediate } }

datamattsson commented 10 months ago

It could be some udev rule that sets it. Have you grep'd around in /etc/udev/rules.d/?

saad0805 commented 10 months ago

Infact ,I have grep in /etc , but nowhere to be found

saad0805 commented 10 months ago

Infact ,I have grep in /etc , but nowhere to be found

datamattsson commented 10 months ago

What does Red Hat have to say about the matter? Are they pointing at the CSI driver?

datamattsson commented 10 months ago

This is what I'm seeing on OCP 4.13 with HPE CSI Driver v2.4.0-beta:

$ cat /sys/class/block/dm-*/queue/read_ahead_kb
128

I am using Nimble in this particular case though where we set it to 128.

datamattsson commented 10 months ago

This is what appears on a Primera, same OCP and CSI driver etc.

$ cat /sys/class/block/dm-*/queue/read_ahead_kb
8160

datamattsson commented 9 months ago

I've determined we can't do anything from the CSI driver perspective. Custom udev rules needs to be created for 3PAR devices on the worker nodes.

Create the below file at /etc/udev/rules.d/99-3par-tune.rules and run udevadm control --reload-rules. Also run udevadm trigger if you have attached devices.

##
# Copyright 2023 Hewlett Packard Enterprise Development LP.
#
##

ACTION!="add|change", GOTO="3par_tuning_end"
SUBSYSTEM!="block", GOTO="3par_tuning_end"
KERNEL!="sd*|dm-*", GOTO="3par_tuning_end"
KERNEL=="dm-*", ENV{DM_UUID}!="mpath-360002ac*", GOTO="3par_tuning_end"
ENV{DEVTYPE}=="partition", GOTO="3par_tuning_end"

# Please uncomment the lines beginning with ATTR to enable these rules
# and run "udevadm control --reload-rules" and "udevadm trigger" to apply for all 3PAR devices.

# set max_sectors_kb to max_hw_sectors_kb.
#ATTR{queue/max_sectors_kb}="4096"
# set read_ahead_kb to 64
ATTR{queue/read_ahead_kb}="64"
# set nr_requests to 512.
#ATTR{queue/nr_requests}="512"
# set scheduler to noop.
#ATTR{queue/scheduler}="noop"
# disable add_random.
#ATTR{queue/add_random}="0"
# disable rotational.
#ATTR{queue/rotational}="0"
# set rq_affinity to 2.
#ATTR{queue/rq_affinity}="2"

LABEL="3par_tuning_end"

saad0805 commented 9 months ago

Thank you Michael for your help. After exchanging with hpe and redhat support we finally found out that the issue was with a change in the calculation of the read_ahead_kb in the linux kernel . This has changed from GA 8.5. So yes the only way is to apply a custom uev rule?

What we are still not sure is whether this fix will work when a pod will be restarted or scheduled to run on another worker node. As we did this already with tuneD operator but the value comes back to 32M as soon as we restart a pod.

We will test this and update you

datamattsson commented 9 months ago

I think the udev rule needs to be injected and enabled by a MachineConfig. I've not made one of these myself before, but here's the documentation on how to do it: https://docs.openshift.com/container-platform/4.12/post_installation_configuration/machine-configuration-tasks.html

Edit: The udev rule will be injected on all worker nodes and udev will intercept all 3PAR devices. Pod restarts won't affect the effective values set by udev.

hpe-storage / csi-driver

performance issue on openshift #357

cat /sys/class/block/dm-*/queue/read_ahead_kb

cat /sys/class/block/sda/queue/read_ahead_kb