scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.38k stars 1.55k forks source link

`perftune.py` should allow rollback / revert the changes it made #2350

Open gdubicki opened 4 months ago

gdubicki commented 4 months ago

Installation details Scylla version (or git commit hash): 5.4.9 Cluster size: 7 nodes OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu 22.04.4 LTS (GNU/Linux 5.15.0-1059-gke x86_64)

Hardware details (for performance issues) Platform (physical/VM/cloud instance type/docker): GKE, v1.29.5-gke.1192000 Hardware: n2d-standard-32, min. CPU platform: AMD Milan Disks: (SSD/HDD, count): 8 x local SSD

We have run the perftune.py on our cluster for the first time and after the changes have been applied our Scylla read and write times have jumped (the change was applied a bit before 21:00):

Screenshot 2024-07-13 at 10 59 10

So far the only way we found to completely revert the changes was to restart the Scylla nodes but it's a long and painful procedure.

I think that especially as there are quite a lot of known issues with this tool (f.e. https://github.com/scylladb/scylladb/issues/14873, https://github.com/scylladb/scylladb/issues/10600, https://github.com/scylladb/seastar/issues/1297, https://github.com/scylladb/seastar/issues/1698, https://github.com/scylladb/seastar/issues/1008 and maybe more), there should be a feature implemented in pertfune.py to be able to revert to the defaults.

gdubicki commented 4 months ago

At a minimum, the script should save the settings before the tuning for a manual revert. While the sysctl and disk settings are easy to revert now, the IRQ ones are hard.

mykaul commented 4 months ago

@vladzcloudius - thoughts? @gdubicki - I think most of the issues mentioned above are benign or less relevant, but I am interested in the specifics of what was configured by perftune on your setup that caused it to slow down.

gdubicki commented 4 months ago

Thank for a quick response, @mykaul!

Our Scylla configuration for the Scylla Operator is as follows:

datacenter: XXX
racks:
- name: YYY
  scyllaConfig: "scylla-config"
  scyllaAgentConfig: "scylla-agent-config"
  members: 7
  storage:
    storageClassName: local-raid-disks
    capacity: 2200G # this is only the initial size, the actual is 3000G now (see https://github.com/scylladb/scylla-operator/issues/402)
  agentResources:
    # requests and limits here need to be equal to make Scylla have Guaranteed QoS class
    requests:
      cpu: 150m
      memory: 768M
    limits:
      cpu: 150m
      memory: 768M
  resources:
    # requests and limits here need to be equal to make Scylla have Guaranteed QoS class
    limits:
      cpu: 31
      memory: 108Gi
    requests:
      cpu: 31
      memory: 108Gi

The output with the changes it applied on one of the nodes looked like this:

$ kubectl logs perftune-containers-89cc03d2-b076-4c41-9877-4c9e985fbd28-xnh5x -n scylla-operator-node-tuning
irqbalance is not running
No non-NVMe disks to tune
Setting NVMe disks: nvme0n1, nvme0n3, nvme0n2, nvme0n4, nvme0n6, nvme0n5, nvme0n8, nvme0n7...
Setting mask 00000001 in /proc/irq/30/smp_affinity
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n3/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n3/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n2/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n2/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n4/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n4/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n6/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n6/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n5/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n5/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n8/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n8/queue/nomerges
Writing 'none' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n7/queue/scheduler
Writing '2' to /sys/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n7/queue/nomerges
Setting a physical interface ens5...
Distributing all IRQs
Setting mask 00000001 in /proc/irq/68/smp_affinity
Setting mask 00000001 in /proc/irq/53/smp_affinity
Setting mask 00000001 in /proc/irq/49/smp_affinity
Setting mask 00000001 in /proc/irq/40/smp_affinity
Setting mask 00000001 in /proc/irq/43/smp_affinity
Setting mask 00000001 in /proc/irq/61/smp_affinity
Setting mask 00000001 in /proc/irq/63/smp_affinity
Setting mask 00000001 in /proc/irq/73/smp_affinity
Setting mask 00000001 in /proc/irq/89/smp_affinity
Setting mask 00000001 in /proc/irq/59/smp_affinity
Setting mask 00000001 in /proc/irq/77/smp_affinity
Setting mask 00000001 in /proc/irq/80/smp_affinity
Setting mask 00000001 in /proc/irq/44/smp_affinity
Setting mask 00000001 in /proc/irq/56/smp_affinity
Setting mask 00000001 in /proc/irq/66/smp_affinity
Setting mask 00000001 in /proc/irq/46/smp_affinity
Setting mask 00000001 in /proc/irq/93/smp_affinity
Setting mask 00000001 in /proc/irq/48/smp_affinity
Setting mask 00000001 in /proc/irq/84/smp_affinity
Setting mask 00000001 in /proc/irq/94/smp_affinity
Setting mask 00000001 in /proc/irq/60/smp_affinity
Setting mask 00000001 in /proc/irq/91/smp_affinity
Setting mask 00000001 in /proc/irq/70/smp_affinity
Setting mask 00000001 in /proc/irq/79/smp_affinity
Setting mask 00000001 in /proc/irq/83/smp_affinity
Setting mask 00000001 in /proc/irq/41/smp_affinity
Setting mask 00000001 in /proc/irq/64/smp_affinity
Setting mask 00000001 in /proc/irq/95/smp_affinity
Setting mask 00000001 in /proc/irq/65/smp_affinity
Setting mask 00000001 in /proc/irq/67/smp_affinity
Setting mask 00000001 in /proc/irq/37/smp_affinity
Setting mask 00000001 in /proc/irq/75/smp_affinity
Setting mask 00000001 in /proc/irq/74/smp_affinity
Setting mask 00000001 in /proc/irq/57/smp_affinity
Setting mask 00000001 in /proc/irq/86/smp_affinity
Setting mask 00000001 in /proc/irq/78/smp_affinity
Setting mask 00000001 in /proc/irq/45/smp_affinity
Setting mask 00000001 in /proc/irq/88/smp_affinity
Setting mask 00000001 in /proc/irq/47/smp_affinity
Setting mask 00000001 in /proc/irq/85/smp_affinity
Setting mask 00000001 in /proc/irq/42/smp_affinity
Setting mask 00000001 in /proc/irq/32/smp_affinity
Setting mask 00000001 in /proc/irq/51/smp_affinity
Setting mask 00000001 in /proc/irq/69/smp_affinity
Setting mask 00000001 in /proc/irq/71/smp_affinity
Setting mask 00000001 in /proc/irq/76/smp_affinity
Setting mask 00000001 in /proc/irq/92/smp_affinity
Setting mask 00000001 in /proc/irq/34/smp_affinity
Setting mask 00000001 in /proc/irq/81/smp_affinity
Setting mask 00000001 in /proc/irq/55/smp_affinity
Setting mask 00000001 in /proc/irq/82/smp_affinity
Setting mask 00000001 in /proc/irq/87/smp_affinity
Setting mask 00000001 in /proc/irq/54/smp_affinity
Setting mask 00000001 in /proc/irq/52/smp_affinity
Setting mask 00000001 in /proc/irq/72/smp_affinity
Setting mask 00000001 in /proc/irq/90/smp_affinity
Setting mask 00000001 in /proc/irq/31/smp_affinity
Setting mask 00000001 in /proc/irq/33/smp_affinity
Setting mask 00000001 in /proc/irq/62/smp_affinity
Setting mask 00000001 in /proc/irq/35/smp_affinity
Setting mask 00000001 in /proc/irq/39/smp_affinity
Setting mask 00000001 in /proc/irq/36/smp_affinity
Setting mask 00000001 in /proc/irq/38/smp_affinity
Setting mask 00000001 in /proc/irq/58/smp_affinity
Setting mask 00000001 in /proc/irq/50/smp_affinity
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-13/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-9/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-31/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-21/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-11/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-7/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-5/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-28/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-18/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-3/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-26/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-16/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-1/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-24/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-14/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-22/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-12/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-8/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-30/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-20/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-10/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-6/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-29/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-19/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-4/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-27/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-17/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-2/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-25/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-15/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-0/rps_cpus
Setting mask fffffffe in /sys/class/net/ens5/queues/rx-23/rps_cpus
Setting net.core.rps_sock_flow_entries to 32768
Setting limit 1024 in /sys/class/net/ens5/queues/rx-13/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-9/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-31/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-21/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-11/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-7/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-5/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-28/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-18/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-3/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-26/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-16/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-1/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-24/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-14/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-22/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-12/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-8/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-30/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-20/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-10/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-6/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-29/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-19/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-4/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-27/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-17/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-2/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-25/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-15/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-0/rps_flow_cnt
Setting limit 1024 in /sys/class/net/ens5/queues/rx-23/rps_flow_cnt
Trying to enable ntuple filtering HW offload for ens5...not supported
Setting mask 00000001 in /sys/class/net/ens5/queues/tx-6/xps_cpus
Setting mask 00010000 in /sys/class/net/ens5/queues/tx-22/xps_cpus
Setting mask 00000002 in /sys/class/net/ens5/queues/tx-12/xps_cpus
Setting mask 00020000 in /sys/class/net/ens5/queues/tx-4/xps_cpus
Setting mask 00000004 in /sys/class/net/ens5/queues/tx-30/xps_cpus
Setting mask 00040000 in /sys/class/net/ens5/queues/tx-20/xps_cpus
Setting mask 00000008 in /sys/class/net/ens5/queues/tx-10/xps_cpus
Setting mask 00080000 in /sys/class/net/ens5/queues/tx-2/xps_cpus
Setting mask 00000010 in /sys/class/net/ens5/queues/tx-29/xps_cpus
Setting mask 00100000 in /sys/class/net/ens5/queues/tx-19/xps_cpus
Setting mask 00000020 in /sys/class/net/ens5/queues/tx-0/xps_cpus
Setting mask 00200000 in /sys/class/net/ens5/queues/tx-27/xps_cpus
Setting mask 00000040 in /sys/class/net/ens5/queues/tx-17/xps_cpus
Setting mask 00400000 in /sys/class/net/ens5/queues/tx-9/xps_cpus
Setting mask 00000080 in /sys/class/net/ens5/queues/tx-25/xps_cpus
Setting mask 00800000 in /sys/class/net/ens5/queues/tx-15/xps_cpus
Setting mask 00000100 in /sys/class/net/ens5/queues/tx-7/xps_cpus
Setting mask 01000000 in /sys/class/net/ens5/queues/tx-23/xps_cpus
Setting mask 00000200 in /sys/class/net/ens5/queues/tx-13/xps_cpus
Setting mask 02000000 in /sys/class/net/ens5/queues/tx-5/xps_cpus
Setting mask 00000400 in /sys/class/net/ens5/queues/tx-31/xps_cpus
Setting mask 04000000 in /sys/class/net/ens5/queues/tx-21/xps_cpus
Setting mask 00000800 in /sys/class/net/ens5/queues/tx-11/xps_cpus
Setting mask 08000000 in /sys/class/net/ens5/queues/tx-3/xps_cpus
Setting mask 00001000 in /sys/class/net/ens5/queues/tx-1/xps_cpus
Setting mask 10000000 in /sys/class/net/ens5/queues/tx-28/xps_cpus
Setting mask 00002000 in /sys/class/net/ens5/queues/tx-18/xps_cpus
Setting mask 20000000 in /sys/class/net/ens5/queues/tx-26/xps_cpus
Setting mask 00004000 in /sys/class/net/ens5/queues/tx-16/xps_cpus
Setting mask 40000000 in /sys/class/net/ens5/queues/tx-8/xps_cpus
Setting mask 00008000 in /sys/class/net/ens5/queues/tx-24/xps_cpus
Setting mask 80000000 in /sys/class/net/ens5/queues/tx-14/xps_cpus
Writing '4096' to /proc/sys/net/core/somaxconn
Writing '4096' to /proc/sys/net/ipv4/tcp_max_syn_backlog

We managed to revert the disk and sysctl settings on all nodes but that alone didn't help.

We reverted all the settings on some nodes by rebooting them.

Then we wanted to revert the masks on all the nodes, but we realized that we didn't know what the settings were before as on the rebooted nodes they are not just set to the defaults. Therefore ultimately we reverted it by rebooting all the nodes.

vladzcloudius commented 4 months ago

think that especially as there are quite a lot of known issues with this tool (f.e. https://github.com/scylladb/scylladb/issues/14873, https://github.com/scylladb/scylladb/issues/10600, https://github.com/scylladb/seastar/issues/1297, https://github.com/scylladb/seastar/issues/1698, https://github.com/scylladb/seastar/issues/1008 and maybe more), there should be a feature implemented in pertfune.py to be able to revert to the defaults.

The only still open GH issue out of the above that is related to perftune.py is a Documentation one.

perftune.py is supposed to be quite trustworthy - especially if you use the version from the seastar master branch.

I'm not familiar of any open bug related to perftune.py at the moment.

As to your request to revert the tuning: this would require backing up the configuration of all values it tunes.

This is a nice feature when you play with things. However in production you should either use it (perftune.py) or not use it. And there is a very easy way to tell Scylla not to use perftune.py tweaking if you are confident this is what you want: set following fields in /etc/default/scylla-server:

SET_NIC_AND_DISKS=no
SET_CLOCKSOURCE=no
DISABLE_WRITEBACK_CACHE=no
gdubicki commented 4 months ago

think that especially as there are quite a lot of known issues with this tool (f.e. scylladb/scylladb#14873, scylladb/scylladb#10600, #1297, #1698, #1008 and maybe more), there should be a feature implemented in pertfune.py to be able to revert to the defaults.

The only still open GH issue out of the above that is related to perftune.py is a Documentation one.

perftune.py is supposed to be quite trustworthy - especially if you use the version from the seastar master branch.

I'm not familiar of any open bug related to perftune.py at the moment.

Respectfully, this issue is related to perftune.

It may not be very clearly visible on the screenshot, but the our average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

The read times have been affected as well, although less painfully.

As to your request to revert the tuning: this would require backing up the configuration of all values it tunes.

This is a nice feature when you play with things. However in production you should either use it (perftune.py) or not use it. And there is a very easy way to tell Scylla not to use perftune.py tweaking if you are confident this is what you want: set following fields in /etc/default/scylla-server:

SET_NIC_AND_DISKS=no
SET_CLOCKSOURCE=no
DISABLE_WRITEBACK_CACHE=no

Well, we did use it and it broke our performance.

Then it was very hard to revert the changes as with the local SSDs on GKE the node restart caused the Scylla nodes to fall into a restart loop. We had to hack them to think they are replacing themselves to make them start without bootstrapping as new nodes. It didn't work for one node and it did bootstrap, which took more than 10 hours.

Overall we spent 3 days reverting the optimisations so I think there is need for a revert feature.

We would be happy to help with this by providing some PRs, but we would probably need some guidance, maybe over Slack?

gdubicki commented 4 months ago

perftune.py is supposed to be quite trustworthy - especially if you use the version from the seastar master branch.

We used the version bundled with Scylla 5.4.9.

mykaul commented 4 months ago

It may not be very clearly visible on the screenshot, but the our average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

This should get its own issue (in Scylla) and we can look at it there, if we understand what changes were made (which I assume is doable, since you reverted them).

gdubicki commented 4 months ago

It may not be very clearly visible on the screenshot, but the our average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

This should get its own issue (in Scylla) and we can look at it there, (...)

Sure, I can open an issue in https://github.com/scylladb/scylladb, if that's a more appropriate place.

(...) if we understand what changes were made (which I assume is doable, since you reverted them).

Well, we know what the settings were after changing them from the perftune logs (above), but we don't exactly know what they were before. That's the point.

vladzcloudius commented 4 months ago

think that especially as there are quite a lot of known issues with this tool (f.e. scylladb/scylladb#14873, scylladb/scylladb#10600, #1297, #1698, #1008 and maybe more), there should be a feature implemented in pertfune.py to be able to revert to the defaults.

The only still open GH issue out of the above that is related to perftune.py is a Documentation one. perftune.py is supposed to be quite trustworthy - especially if you use the version from the seastar master branch. I'm not familiar of any open bug related to perftune.py at the moment.

Respectfully, this issue is related to perftune.

It may not be very clearly visible on the screenshot, but the our average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

@gdubicki you need to keep in mind that you should only (!!) use perftune.py in conjunction with corresponding Scylla CPUs pinning.

Was that all the case?

The read times have been affected as well, although less painfully.

As to your request to revert the tuning: this would require backing up the configuration of all values it tunes. This is a nice feature when you play with things. However in production you should either use it (perftune.py) or not use it. And there is a very easy way to tell Scylla not to use perftune.py tweaking if you are confident this is what you want: set following fields in /etc/default/scylla-server:

SET_NIC_AND_DISKS=no
SET_CLOCKSOURCE=no
DISABLE_WRITEBACK_CACHE=no

Well, we did use it and it broke our performance.

Then it was very hard to revert the changes as with the local SSDs on GKE the node restart caused the Scylla nodes to fall into a restart loop. We had to hack them to think they are replacing themselves to make them start without bootstrapping as new nodes. It didn't work for one node and it did bootstrap, which took more than 10 hours.

Overall we spent 3 days reverting the optimisations so I think there is need for a revert feature.

We would be happy to help with this by providing some PRs, but we would probably need some guidance, maybe over Slack?

gdubicki commented 4 months ago

think that especially as there are quite a lot of known issues with this tool (f.e. scylladb/scylladb#14873, scylladb/scylladb#10600, #1297, #1698, #1008 and maybe more), there should be a feature implemented in pertfune.py to be able to revert to the defaults.

The only still open GH issue out of the above that is related to perftune.py is a Documentation one. perftune.py is supposed to be quite trustworthy - especially if you use the version from the seastar master branch. I'm not familiar of any open bug related to perftune.py at the moment.

Respectfully, this issue is related to perftune. It may not be very clearly visible on the screenshot, but the our average write times have increased from ~500ms to ~3000ms (so became about 6 x higher) while the 95 percentile has increased from ~2500ms to ~17500ms (about 7 x higher)!

@gdubicki you need to keep in mind that you should only (!!) use perftune.py in conjunction with corresponding Scylla CPUs pinning.

If you mean using the static CPU manager policy with Guaranteed QoS class, then we did that. See our config in this comment.

But maybe it was wrong to allocate 31 cores for Scylla out of 32 core machine? Should we leave some cores free here? 🤔

  • You should also keep in mind that perftune.py needs to run on the Hypervisor - not from the POD.

We have run perftune using Scylla Operator (v. 1.13.0), so it's done in whatever way it does it.

  • On top of that you need to remember that you must make sure that Scylla PODs CPUs are never allowed to run on the so called IRQ CPUs - the ones perftune.py pins IRQs affinities to: in your case it was CPU0.

Was that all the case?

I don't know, to be frank.

We have just configured Scylla as in this comment on a n2d-standard-32 node and enabled the performance tuning that used perftune.

mykaul commented 4 months ago

I think you need to use 'cpuset' to ensure pods are assigned static CPU assignment.

gdubicki commented 4 months ago

It might be interesting that this was logged by Scylla when starting on the restarted nodes:

Screenshot 2024-07-16 at 17 14 12

The first measurement looks about right for 8 local NVMe SSD nodes in GCP, but all the other results are very bad.

Note that this is from the restarts to disable the perftune optimizations.

...however the values ultimately written to the config file look like roughly like this on all the nodes:

root@...:/# cat /etc/scylla.d/io_properties.yaml
disks:
  - mountpoint: /var/lib/scylla
    read_iops: 721497
    read_bandwidth: 2950537984
    write_iops: 401104
    write_bandwidth: 1623555072

...except one, which has substantially lower values for writes:

root@...-hrjn:/# cat /etc/scylla.d/io_properties.yaml
disks:
  - mountpoint: /var/lib/scylla
    read_iops: 682532
    read_bandwidth: 2951263744
    write_iops: 39928
    write_bandwidth: 759449856

...but I suppose it's a measurement error.

gdubicki commented 4 months ago

I think you need to use 'cpuset' to ensure pods are assigned static CPU assignment.

According to this doc this is done automatically and for our nodes is set like this right now:

root@...:/# cat /etc/scylla.d/cpuset.conf
# DO NO EDIT
# This file should be automatically configure by scylla_cpuset_setup
#
# CPUSET="--cpuset 0 --smp 1"
CPUSET="--cpuset 1-31 "
mykaul commented 4 months ago

Different io_properties.yaml are interesting. Either some issue, or you got a lemon. That happens :-/

gdubicki commented 4 months ago

What bugs me is the question if it is right to assign 31 cores on a 32 core machine for the Scylla pod? Shouldn’t I leave a bit more free for other workloads? (But note that these are dedicated nodes for Scylla, the other workloads are other Scylla pods, Datadog and kubesystem only).

mykaul commented 4 months ago

What bugs me is the question if it is right to assign 31 cores on a 32 core machine for the Scylla pod? Shouldn’t I leave a bit more free for other workloads? (But note that these are dedicated nodes for Scylla, the other workloads are other Scylla pods, Datadog and kubesystem only).

You are asking the wrong question - the question is how many cores you should dedicate to network IRQ handling vs. Scylla cores. That's a ratio you need to ensure is reasonable. Scylla can work on fewer cores - it's up to you how many you wish to have. With very few cores, we don't even use dedicated cores for network processing. That's what perftune does (among other things). Specifically, 31 out of 32 doesn't make sense to me. Should be more to networking.

vladzcloudius commented 4 months ago

What bugs me is the question if it is right to assign 31 cores on a 32 core machine for the Scylla pod?

This doesn't look correct indeed. Regardless whether you have HT enabled or disabled perftune.py was supposed to allocate 2 CPUs for IRQs.

My I see the content of /etc/scylla.d/perftune.yaml?

We have run perftune using Scylla Operator (v. 1.13.0), so it's done in whatever way it does it.

The page above is a bit unclear what needs to be run where but allow me to re-iterate: 1) perftune.py must configure the VM resources. I'm not a K8S specialist but usually you can't change Host level OS configuration from inside the container. Hence you should run perftune.py manually on the host VM itself. 2) If you want to achieve the maximum performance using perftune.py the container must be pinned to the corresponding host CPUs as mentioned by @mykaul above and it should never be allowed to run on "IRQ CPUs". In the configuration above I don't see where the POD is forbidden to run on the CPU0 - I only see that you tell it to use 31 CPUs - but which ones will be used in this case? I'm not sure it's safe to assume it will be 1-31. I'd assume it will more likely be 0-30. 3) When running (1) on the host pay attention which CPUs are configured as "IRQ CPUs" - you can use a --get-irq-cpu-mask perftune.py parameter to print the corresponding CPU mask. And then make sure to pin your POD away from those CPUs and make sure to use the corresponding value in cpuset.conf you referenced above.

Let me know if there are more questions I can help you with, @gdubicki ?

gdubicki commented 4 months ago

Thanks @vladzcloudius!

My I see the content of /etc/scylla.d/perftune.yaml?

I don't have a copy of it from before reverting the tuning, and now this file does not exist on my Scylla nodes.

The page above is a bit unclear what needs to be run where but allow me to re-iterate: (...)

I guess that we would need to ask the Scylla Operator team about whether this is done this way. cc @tnozicka