openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
102 stars 105 forks source link

CNF-11099: set intel_pstate driver to automatic as default #950

Closed sabbir-47 closed 7 months ago

sabbir-47 commented 9 months ago

What?

Set CPUFreq driver mode based on hardware generation which will set intel_pstate=active for IceLake and newer processors while it will disable the pstate for older generation of processors.

Why?

[variables]

automatic_pstate=${f:intel_recommended_pstate}
.........
.........

{{if .PerPodPowerManagement}}
cmdline_pstate=+intel_pstate=passive
{{else if .HardwareTuning}}
cmdline_pstate=+intel_pstate=active
{{else}}
cmdline_pstate=+intel_pstate=${automatic_pstate}
{{end}}

It will update the assets/performanceprofile/tuned/openshift-node-performance and render the profile with appropriate intel_pstate

Performance impact on the system

We internally ran KPI tests, i.e. oslat, cyclicTest, cpu utilization and RFC2544 to identify if activating pstate in IceLake and Sapphire Rapids processor servers cause any performance variance. We found no indication of performance degradation.

Can it be overridden by user?

User can always override this kernel configuration with tuned, for example if they want to disable intel_pstate:

apiVersion: tuned.openshift.io/v1
kind: Tuned
........
spec:
    profile:
    - data: |
        ....
        [bootloader]
        cmdline_pstate=intel_pstate=disable

/cc @MarSik @yanirq @jmencak @bartwensley

openshift-ci-robot commented 9 months ago

@sabbir-47: This pull request references CNF-11099 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/950): > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 9 months ago

Hi @sabbir-47. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci-robot commented 9 months ago

@sabbir-47: This pull request references CNF-11099 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/950): >## What? >Set intel_pstate=active as default to enable CPUFreq driver instead of enabling acpi-cpufreq driver. > >## Why? > >- For FlexRAN (and FlexRAN-like applications), the hardware vendor(I//) recommends to use the intel_pstate CPUFreq driver in active mode with HWP enabled on Ice Lake and later generations. The majority of Telco RAN DU deployments will be on the Ice Lake or newer generation hardware. >- We have One RAN customer (E//) who wants the intel pstate to be active as default. [RFE link](https://issues.redhat.com/browse/RFE-4138) > > ## How? >We don't update any provided tuneD profile. But we modify the custom tuned profile for Openshift in the NTO operator. So irrespective to `realTime=true` or `realTime=false`, the intel pstate will be set to active, whereas previously it would disable the intel_pstate. So the new logic would be: > >``` >{{if .PerPodPowerManagement}} >cmdline_pstate=+intel_pstate=passive >{{else}} >cmdline_pstate=+intel_pstate=active >``` >It will update the `assets/performanceprofile/tuned/openshift-node-performance` and render the profile with intel_pstate=active > >## Can it be overridden by user? >User can always override this kernel configuration with tuned, for example if they want to disable intel_pstate: > >``` >apiVersion: tuned.openshift.io/v1 >kind: Tuned >........ >spec: > profile: > - data: | > .... > [bootloader] > cmdline_pstate=intel_pstate=disable > >``` > >/cc @MarSik @yanirq @jmencak @bartwensley Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
sabbir-47 commented 9 months ago

/assign @MarSik @yanirq @jmencak @bartwensley

MarSik commented 9 months ago

I am a bit worried about the assumptions here, because it is a backwards incompatible change.

The majority of Telco RAN DU deployments will be on the Ice Lake or newer generation hardware.

I still see many reports and labs where older generations are in use.

Also I seem to remember we disabled p-states and c-states for the most sensitive real time cases in the past as the power management features were introducing latency. Is that fixed on Ice Lake and newer cpus?

Switching the knob is fine, but lets make sure all the pieces fit together for all cases and old deployments we support here.

Maybe we should just use the cpu matching tuned provides and limit this to new enough cpus? I wonder if https://github.com/redhat-performance/tuned/commit/8d9cd00387426ed0cf220f920be4f9e185f61e12 lets us do what we need or some other functionality should be introduced.

sabbir-47 commented 9 months ago

I am a bit worried about the assumptions here, because it is a backwards incompatible change.

The majority of Telco RAN DU deployments will be on the Ice Lake or newer generation hardware.

I still see many reports and labs where older generations are in use.

Yes in many places, we still use older servers.

Also I seem to remember we disabled p-states and c-states for the most sensitive real time cases in the past as the power management features were introducing latency. Is that fixed on Ice Lake and newer cpus?

According to @joemario:

It was common practice to disable pstate for latency sensitive application, since it introduced jitters around RHEL-7.4. In recent years, the jitter issues in the intel_pstate driver has been fixed and observed good latency results with it enabled.

Switching the knob is fine, but lets make sure all the pieces fit together for all cases and old deployments we support here.

Maybe we should just use the cpu matching tuned provides and limit this to new enough cpus? I wonder if redhat-performance/tuned@8d9cd00 lets us do what we need or some other functionality should be introduced.

Do we have a way to understand the processors generation from /proc/cpuinfo or lscpu or other places. My understanding is cascade lake, ice lake, sapphire rapids - these terms are used as marketing purposes. Please let me know if there is a way to distinguish between processors generation. Adding @bartwensley for more information.

openshift-ci-robot commented 9 months ago

@sabbir-47: This pull request references CNF-11099 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/950): >## What? >Set intel_pstate=active as default to enable CPUFreq driver instead of enabling acpi-cpufreq driver. > >## Why? > >- For FlexRAN (and FlexRAN-like applications), the hardware vendor(I//) recommends to use the intel_pstate CPUFreq driver in active mode with HWP enabled on Ice Lake and later generations. The majority of Telco RAN DU deployments will be on the Ice Lake or newer generation hardware. >- We have One RAN customer who wants the intel pstate to be active as default. [RFE link](https://issues.redhat.com/browse/RFE-4138) > > ## How? >We don't update any provided tuneD profile. But we modify the custom tuned profile for Openshift in the NTO operator. So irrespective to `realTime=true` or `realTime=false`, the intel pstate will be set to active, whereas previously it would disable the intel_pstate. So the new logic would be: > >``` >{{if .PerPodPowerManagement}} >cmdline_pstate=+intel_pstate=passive >{{else}} >cmdline_pstate=+intel_pstate=active >``` >It will update the `assets/performanceprofile/tuned/openshift-node-performance` and render the profile with intel_pstate=active > >## Can it be overridden by user? >User can always override this kernel configuration with tuned, for example if they want to disable intel_pstate: > >``` >apiVersion: tuned.openshift.io/v1 >kind: Tuned >........ >spec: > profile: > - data: | > .... > [bootloader] > cmdline_pstate=intel_pstate=disable > >``` > >/cc @MarSik @yanirq @jmencak @bartwensley Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
bartwensley commented 9 months ago

Good questions @MarSik - I'll try to summarize:

Also I seem to remember we disabled p-states and c-states for the most sensitive real time cases in the past as the power management features were introducing latency. Is that fixed on Ice Lake and newer cpus?

The latency issues have been fixed on Intel processors of the Ice Lake and newer generations. The recommendation from Intel is to use intel_pstate=active on these generations.

Maybe we should just use the cpu matching tuned provides and limit this to new enough cpus? I wonder if redhat-performance/tuned@8d9cd00 lets us do what we need or some other functionality should be introduced.

My understanding is that there is no reliable way of determining the processor generation at runtime, so I do not believe this is an option. The above change only looks at /dev/cpuinfo which only contains the model number, which would require an unmaintainable regex to match all current and future Intel processor model names.

The proposed change would only be an issue for a user that:

  1. Has a cascade lake (or older) processor.
  2. Is applying the openshift-node-performance performance profile.
  3. Is running a latency sensitive application.

In that case, the user would be required to override the intel_pstate to set it to disabled as described in the PR description above. This would have to be done as part of new deployments or when upgrading to the version of the NTO that contains this change. We would want to add a release note for this.

The alternative is to withdraw this PR and require all users running latency sensitive applications on Ice Lake (or newer) processors to override the intel_pstate to set it to active.

Given that the Ice Lake generation was launched in 2019 and that most users will be using these processors to run low latency applications, I would prefer to make this change to the default intel_pstate configuration, but I agree this could be a backwards incompatible change in some cases and would require the user to make a configuration change.

We need consensus here before making this change.

/cc @browsell

MarSik commented 7 months ago

/ok-to-test

openshift-ci-robot commented 7 months ago

@sabbir-47: This pull request references CNF-11099 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/950): >## What? >Set CPUFreq driver mode based on hardware generation which will set intel_pstate=active for IceLake and newer processors while it will disable the pstate for older generation of processors. > >## Why? > >- For FlexRAN (and FlexRAN-like applications), the hardware vendor(I//) recommends to use the intel_pstate CPUFreq driver in active mode with HWP enabled on Ice Lake and later generations. The majority of Telco RAN DU deployments will be on the Ice Lake or newer generation hardware. >- We have One RAN customer who wants the intel pstate to be active as default. [RFE link](https://issues.redhat.com/browse/RFE-4138) > > ## How? >We introduced a function in [tuned](https://github.com/redhat-performance/tuned/blob/master/tuned/profiles/functions/function_intel_recommended_pstate.py) which returns appropriate CPUFreq driver mode based on hardware generation. Then we invoke the function in the custom tuned profile for Openshift in the NTO operator. We still keep pstate to active for HardwareTuning case,because user may want to tune cpu frequencies in older generation processors. > >``` >[variables] > >automatic_pstate=${f:intel_recommended_pstate} >......... >......... > >{{if .PerPodPowerManagement}} >cmdline_pstate=+intel_pstate=passive >{{else if .HardwareTuning}} >cmdline_pstate=+intel_pstate=active >{{else}} >cmdline_pstate=+intel_pstate=${automatic_pstate} >{{end}} >``` >It will update the `assets/performanceprofile/tuned/openshift-node-performance` and render the profile with appropriate intel_pstate > >## Performance impact on the system >We internally ran KPI tests, i.e. oslat, cyclicTest, cpu utilization and RFC2544 to identify if activating pstate in IceLake and Sapphire Rapids processor servers cause any performance variance. We found no indication of performance degradation. > >- IceLake: [KPI test Result](http://ocp-far-edge-vran-deployment-kpi.hosts.prod.psi.rdu2.redhat.com/backend/api/v1/reports/file/html/5a6d9b92-141b-4e78-97f8-53c6fc9b7c65) >- Sapphire Rapids: [KPI test Result](http://ocp-far-edge-vran-deployment-kpi.hosts.prod.psi.rdu2.redhat.com/backend/api/v1/reports/file/html/c8ad5389-6772-4d4a-9860-3d786fa1ee5b) > >## Can it be overridden by user? >User can always override this kernel configuration with tuned, for example if they want to disable intel_pstate: > >``` >apiVersion: tuned.openshift.io/v1 >kind: Tuned >........ >spec: > profile: > - data: | > .... > [bootloader] > cmdline_pstate=intel_pstate=disable > >``` > >/cc @MarSik @yanirq @jmencak @bartwensley Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
sabbir-47 commented 7 months ago

/retest

MarSik commented 7 months ago

@sabbir-47 The rendering (e2e-no-cluster) tests are deterministic (offline), retest is not likely to help there.

sabbir-47 commented 7 months ago

@MarSik Actually i fixed that test with my last commit, it seemed to pass now! I needed to rebase to get newer changes.

sabbir-47 commented 7 months ago

/test e2e-gcp-pao

joemario commented 7 months ago

If this change is important to OCP for newer Intel cpus, would it make sense to make the change in TuneD? Non-OCP environments would be able to benefit as well if it were in TuneD.

sabbir-47 commented 7 months ago

/test e2e-gcp-pao-workloadhints

sabbir-47 commented 7 months ago

/retest

sabbir-47 commented 7 months ago

/hold needs TuneD FDP 24.C release first

sabbir-47 commented 7 months ago

/unhold

sabbir-47 commented 7 months ago

@MarSik Should we run the tests once again just to be sure? wdyt?

sabbir-47 commented 7 months ago

/retest

MarSik commented 7 months ago

/lgtm

sabbir-47 commented 7 months ago

@MarSik @yanirq @jmencak @bartwensley can the PR get the approval since the tuned 24.C FDP release happened 2 days back?

jmencak commented 7 months ago

/approve

openshift-ci-robot commented 7 months ago

/retest-required

Remaining retests: 0 against base HEAD dd2698c43517191d17dc5c4ac842e6d4dc3aab32 and 2 for PR HEAD 00f59db17d0de8351198d2b1eb91f50b95848986 in total

openshift-ci[bot] commented 7 months ago

@sabbir-47: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
sabbir-47 commented 7 months ago

@MarSik @yanirq @jmencak @bartwensley all tests passed. Kindly consider reviewing it.

openshift-ci[bot] commented 7 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak, MarSik, sabbir-47

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/master/OWNERS)~~ [MarSik,jmencak] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment