redhat-performance / tuned

Tuning Profile Delivery Mechanism for Linux
GNU General Public License v2.0
755 stars 171 forks source link

Tuning of individual interrupts #580

Closed adriaan42 closed 1 month ago

adriaan42 commented 7 months ago

To optimize latencies in realtime applications, we have a need to direct certain interrupts to certain cpus running our realtime processes. In addition, the cpu allocation happens dynamically (e.g., applications deployed as containers).

There's two parts to this:

This PoC

Example, on a 4-core machine, with an irq section in the profile:

[irq]
affinity=0

And then dynamically changing the affinity of irq16:

# busctl call com.redhat.tuned /Tuned com.redhat.tuned.control instance_create ssa{ss} irq myinstance 1 affinity 2
bs true "OK"
# busctl call com.redhat.tuned /Tuned com.redhat.tuned.control instance_acquire_devices ss irq16 myinstance
(bs) true "OK"

And in the log:

2023-12-08 15:16:05,638 INFO     tuned.daemon.controller: Created dynamic instance 'myinstance' of plugin 'irq'
2023-12-08 15:16:29,263 INFO     tuned.daemon.controller: Moving devices '{'irq16'}' from instance 'irq' to instance 'myinstance'.
2023-12-08 15:16:29,263 INFO     tuned.plugins.plugin_irq: _instance_unapply_dynamic(irq, irq16)
2023-12-08 15:16:29,264 INFO     tuned.plugins.plugin_irq: Setting SMP affinity of IRQ 16 to '0000000f'
2023-12-08 15:16:29,264 INFO     tuned.plugins.hotplug: instance myinstance: adding new device irq16
2023-12-08 15:16:29,265 INFO     tuned.plugins.plugin_irq: _instance_apply_dynamic(myinstance, irq16)
2023-12-08 15:16:29,265 INFO     tuned.plugins.plugin_irq: Setting SMP affinity of IRQ 16 to '00000004'

Open questions

adriaan42 commented 6 months ago

@yarda Any thoughts on this? Do you think a feature like this could be accepted?

yarda commented 5 months ago

/packit build

yarda commented 5 months ago

Hi @adriaan42 this is interesting RFE.

Open questions

* What happens to the irq affinity feature of the `scheduler` plugin? For backwards-compatibility I guess we'll need to keep the feature and a switch to be able to tell `scheduler` to not touch interrupt affinities in case we want to use the new plugin?

We are already trying to address this https://issues.redhat.com/browse/RHEL-21923

* Can we move devices between plugin instances without running "unapply"? In my example above the interrupt affinity briefly gets restored to 0xf before the new desired value is applied...

This is worth investigation and I was already thinking about it, because the "unapply" can be costly and can cause various interference. Generally, we need it, because one instance can change things which the other instance doesn't touch and wouldn't be then rolled back. Maybe we could handle it by more intelligent e.g. "switch" method that would combine "apply" and "unapply". I.e. it could compare the options and roll back only those which are not handled by the second instance. For those which are handled by both instances the second instance could just take the saved previous values from the first instance.

* What happens when TuneD is restarted? Currently, any dynamic instances would be lost. But currently, also any device assignments that have been made using `instance_aquire_devices` are lost, right?

Yes, I think it's OK. Usually it maybe useful to start-over with the clear stock profile. If it is required to preserve the state, I think we could add API call to e.g. dump the profile into /run/tuned and make the /run/tuned preference over /etc/tuned.

yarda commented 5 months ago

Please fix centos-7 CI failures. We are going to drop python 2.7 support once RHEL-7 is retired.

...
Compiling /builddir/build/BUILDROOT/tuned-2.21.0-1.20240118103851136139.pr580.33.ga7d8364.el7.x86_64/usr/lib/python2.7/site-packages/tuned/plugins/plugin_irq.py ...
  File "/usr/lib/python2.7/site-packages/tuned/plugins/plugin_irq.py", line 33
    log.info(f"_instance_init({instance.name})")
                                              ^
SyntaxError: invalid syntax
...
yarda commented 5 months ago

For better backward compatibility I would drop 7799dd2b4bafcbc23f677145ef5f9266ec5657df and rely on fix from the https://issues.redhat.com/browse/RHEL-21923 and would just change the TuneD profiles where needed.

yarda commented 5 months ago

Maybe the instance_create/destroy could be more generic, i.e.:

Or maybe for simple PoC just allow destroying instances with no devices and do not distinguish between statically and dynamically created instances, i.e. if there are no devices it can be destroyed.

adriaan42 commented 5 months ago

PR updates

Some open points

And one more thing

The goal of this is to be able to tune a device (usually a NIC) to be used exclusively and with low latency by a realtime application. The IRQ affinity is just one part of that. There is also:

This maybe goes beyond the scope if an "irq plugin", but logically it could fit here. But I haven't thought it through, especially the conflicts with the scheduler plugin. Maybe you have an opinion/idea?

zacikpa commented 5 months ago

So far I tested the new irq_process option in the scheduler plugin and it seems to work nice. When doing the final rebase/squash, you should probably add Resolves: RHEL-21923 into the commit message.

I have removed the _created_dynamically flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?

I can't think of a reason why deleting the last plugin instance should be disallowed. Technically, we could also permit creating instances of inactive plugins via the create_instance command (though I'm not sure whether that has a real-life use case).

E.g., if we say when an instance with assigned devices is deleted, we just automatically hotplug the devices to other instances -> What happens if there is more than one suitable instance?

IIUC the hotplug implementation currently assigns new devices to suitable plugin instances based on their priority option, as sorted here, so maybe that code could be re-used (and possibly overrided in the future if we want custom behavior in specific plugins). Although to be able to set the priority of newly created instances, the code would need to be refactored a bit.

yarda commented 5 months ago

Fedora rawhide CI failure is caused by regression in the kernel: https://bugzilla.redhat.com/show_bug.cgi?id=2262526

yarda commented 5 months ago

@adriaan42 please fix the CodeQL found problems.

yarda commented 5 months ago

IMHO: dee5d6f63aab46427ade7415babf3c5ef1488c52 - it's OK for merge now aad02c76e152bed9aea7a34846a013278db973f8 - tighten the policy (per inline comment) and it's probably OK 17b528c0d2065ff902d6ceb00407569c2a8f2c05 - this still needs some work, see comments

yarda commented 5 months ago

Some open points

  • Deleting instances: I have removed the _created_dynamically flag and respective checks, so also non-dynamically-created instances can be deleted. Should there be rules here? E.g., what happens if all instances of a plugin are deleted?

As @zacikpa wrote I don't think we need any limitation here. Currently, at runtime I think you won't be able to achieve it, because through the API you can only move devices between instances, thus there will be always at least one instance with some devices which cannot be removed. If you startup with empty instances then you will be able to delete all instances, but I think it shouldn't cause any problem.

  • How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan /proc/irq and adapt the devices known to the plugin?

This is usually done by udev. It seems udev can parse the irqs:

udevadm info /sys/kernel/irq/59
P: /kernel/irq/59
M: 59

But I don't know whether it support hotplug for irqs. If yes we are probably done just by switching to udev enumeration. If not, we probably cannot support hotplug of irqs without periodic rescans and in such case I wouldn't cope with it for now.

  • Moving devices between plugin instances without running "unapply": That is certainly interesting, and I have some ideas, but I think that's for a future PR.

OK, I agree.

  • Preserving state across daemon restarts: In general, I think a daemon restart (without changing the profile) should not have an impact on current tuning, so instances and attached devices should be persistent. But this is a bigger topic, and won't be addressed in this PR.

I agree.

And one more thing

The goal of this is to be able to tune a device (usually a NIC) to be used exclusively and with low latency by a realtime application. The IRQ affinity is just one part of that. There is also:

  • IRQ threads (kernel threads irq/*). The kernel automatically migrates them when the IRQ affinity is changed, so we don't need to touch their affinities manually, and in case we need to set their priority I believe we can do that with the scheduler plugin.

  • NAPI threads (kernel threads napi/*, created for a NIC, depending on the setting in /sys/class/net/%iface%/threaded). Those are created SCHED_OTHER, and we can use the scheduler plugin to give them a suitable policy and priority. Their default affinity is "all cores", and currently we could use the scheduler plugin to set a static affinity for them. But we currently can't dynamically set their affinity.

  • Maybe other/future kernel threads that we'd also like to migrate dynamically.

This maybe goes beyond the scope if an "irq plugin", but logically it could fit here. But I haven't thought it through, especially the conflicts with the scheduler plugin. Maybe you have an opinion/idea?

I think we can duplicate the functionality in plugins if necessary and cover the limitations in documentation. Otherwise I don't have more ideas at the moment.

yarda commented 5 months ago

IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments

Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.

adriaan42 commented 5 months ago

IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments

Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.

Yes, the commits are pretty independent and I'm happy to split the PR. I agree that the first one is ready. For the second one, I'm not sure: this currently does not consider hotplugging, and with hotplugging we might need additional parameters for the instance_create call, like a devices_udev_regex and an instance priority. It looks like the priority can be passed as part of the options, but I'm not sure about the udev regex. That seems to be a separate argument to the instance creation.

I'll open independent PRs and we can discuss there...

yarda commented 5 months ago

IMHO: dee5d6f - it's OK for merge now aad02c7 - tighten the policy (per inline comment) and it's probably OK 17b528c - this still needs some work, see comments

Maybe could you split the PR? The first commit is ready, the second almost ready and we would like to have them in the upcoming release which will happen very soon. The last commit (irq plugin) may take some more time.

Yes, the commits are pretty independent and I'm happy to split the PR. I agree that the first one is ready. For the second one, I'm not sure: this currently does not consider hotplugging, and with hotplugging we might need additional parameters for the instance_create call, like a devices_udev_regex and an instance priority. It looks like the priority can be passed as part of the options, but I'm not sure about the udev regex. That seems to be a separate argument to the instance creation.

I'll open independent PRs and we can discuss there...

Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.

adriaan42 commented 5 months ago

Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.

I was worried that we might need to change the dbus interface later to support hotplugging. But I had another look, and I believe we could still pass all required argument in options via dbus, and can then extract them just like the Unit constructor (https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/unit.py#L12).

So if it's ok for you, we can already use it and refine later. I opened #596, with the updated policies, and a NOTE in the code about the current shortcomings.

yarda commented 5 months ago

Thanks. Regarding the second PR I thought it's good enough for the basic functionality and could be improved later. But if you want to do it right +1 from me.

I was worried that we might need to change the dbus interface later to support hotplugging. But I had another look, and I believe we could still pass all required argument in options via dbus, and can then extract them just like the Unit constructor (https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/unit.py#L12).

So if it's ok for you, we can already use it and refine later. I opened #596, with the updated policies, and a NOTE in the code about the current shortcomings.

Thanks for the split. Currently, only the first commit is important for the upcoming release. The second commit is nice to have and I think it can wait, so I will keep it open.

yarda commented 5 months ago
  • How do we want to handle (physical) hotplugging of devices, which leads to adding/removal of IRQs in the system? Do we need to periodically scan /proc/irq and adapt the devices known to the plugin?

This is usually done by udev. It seems udev can parse the irqs:

udevadm info /sys/kernel/irq/59
P: /kernel/irq/59
M: 59

But I don't know whether it support hotplug for irqs. If yes we are probably done just by switching to udev enumeration. If not, we probably cannot support hotplug of irqs without periodic rescans and in such case I wouldn't cope with it for now.

Hmm, I am afraid this will not work with udev, I tried simple reproducer:


import pyudev

udev_context = pyudev.Context() for d in udev_context.list_devices(): print(d)


Unfortunately, it seems it doesn't know the IRQs.
adriaan42 commented 3 months ago

Another PR update:

I'll try to summarize the previous discussion and open topics:

yarda commented 1 month ago

Sorry for the delay, but hopefully it's not urgent, although it's very nice RFE.

* Determine the "first instance": The way as implemented in the `cpu` plugin does not work (basically the same issue I mentioned in [cpu: add disable_thread_siblings option #366 (comment)](https://github.com/redhat-performance/tuned/pull/366#issuecomment-1738633828)):

  * I'd like to be able to have a configuration like this:
    ```ini
    [irq_default]
    type=irq
    affinity=0
    default_affinity=0
    devices=*
    [irq_special]
    type=irq
    priority=-1
    affinity=2
    devices=irq42
    ```
  * So there is a default instance that should handle global/static tuning and all devices not matching any other instance, and then dedicated instance for exceptions.

You can define the priority also with the order change, e.g.:

[irq_special]
type=irq
affinity=2
devices=irq42

[irq_default]
type=irq
affinity=0
default_affinity=0
devices=*
  * The current implementation in the CPU plugin (which is the only other plugin with a "first instance") always takes the instance with the highest priority (lowest value) as "first instance", but by doing that, this instance is started first, and automatically claims all free devices.
  * In my current proposal here, I determine the "first instance" by name (`instance._is_main_instance = instance.name == instance.plugin.name`), but I'm open to other suggestions, e.g. an explicit option (`first_instance=True`)...

Do we need the first instance here? Although your approach with the first instance being the instance with the plugin name works, it's a new TuneD paradigm. What about using special device name, e.g. "DEFAULT", then:

[irq_default]
type=irq
affinity=0
devices=DEFAULT

I.e. handle the default SMP affinity as a special IRQ, then it could be matched by the glob '*', so you could write:

[irq_default]
type=irq
affinity=0

To cover all remaining IRQs including the default setting (implicit devices=*). In case you don't want the default SMP, which I think you usually want:

[irq_nondefault]
type=irq
affinity=0
devices=!DEFAULT

Then I think you could also use the calc parameter in the affinity which could simplify things. What do you think?

adriaan42 commented 1 month ago

Do we need the first instance here? Although your approach with the first instance being the instance with the plugin name works, it's a new TuneD paradigm. What about using special device name, e.g. "DEFAULT", then:

Good idea! Yes, that works and actually simplifies the code.

Then I think you could also use the calc parameter in the affinity which could simplify things. What do you think?

I'm not entirely sure how to handle the special values ignore/calc present in the scheduler plugin:

yarda commented 1 month ago
* We no longer need `ignore`. An empty affinity list means "don't touch".

OK.

* `calc` means "derive affinity from `isolated_cores`". We don't have that variable in the new plugin, and the user now needs to calculate the desired set explicitly. I think that's ok, and there already are functions available to do this (e.g. `cpulist_invert`).

OK, it's always good to use what's currently available.

* `calc` also means "intersect the desired affinity with the current one". I assume that's so TuneD does not undo affinity settings made by someone else? Is that still needed? If so, then we could introduce a parameter `mode` which can be `set`, to always apply the affinity value (default), or `intersect` to calculate the new value based on the current one.

I think mode is OK, I think it's the most flexible approach.

  ```ini
  [scheduler]
  isolated_cores=${isolcpus}
  default_irq_smp_affinity=calc
  ```
  would then become
  ```ini
  [irq]
  affinity=${f:cpulist_invert:${isolcpus}}
  devices=!default
  [irqdefault]
  affinity=${f:cpulist_invert:${isolcpus}}
  devices=default
  mode=intersect
  ```

I think this can be even simplified to:

[irqdefault]
affinity=${f:cpulist_invert:${isolcpus}}
devices=default
mode=intersect
[irq]
affinity=${f:cpulist_invert:${isolcpus}}

Maybe I would use uppercase name i.e. DEFAULT, as it is some special (virtual) device. IIRC some special values were previously defined in uppercase in TuneD and IMHO there is also lower probability of the name conflict with some existing device name. But if you like the lowercase variant more, I am not against.

adriaan42 commented 1 month ago

Pushed updated implementation:

yarda commented 1 month ago

Thanks, LGTM.