kubewharf / katalyst-core

Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is the core components in Katalyst system, including multiple agents and centralized components
Apache License 2.0
389 stars 91 forks source link

feat(mbm): memory bandwidth management #628

Open h-w-chen opened 2 weeks ago

h-w-chen commented 2 weeks ago

What type of PR is this?

Feature: full mbm (memory bandwidth management) functionality

What this PR does / why we need it:

this PR includes 2 major parts:

  1. memory bandwidth metrics collection: periodically collects NUMA/package level memory bandwidth/latency data, and saves to metric store for other component to use.
  2. memory bandwidth adjustment: adjust mem bandwidth quotas of numa nodes in a physical package to ensure workloads' bandwidth not impacted by noisy neighbours (those consume too much bandwidth) based on specific threshold value
Related Startup Args

By default this metric provisioner is disabled. To enable it, start up arg like below should be provided:

--metric-provisioners="...,mbw,..."

The interval (e.g. 1 second) to refresh these metrics is set as start up following arg (the default 5 seconds is not the typically desired value for these metrics):

--metric-provisioner-intervals mbw=1s

To enable mem bandwidth adjustment, and specify the adjustmant cycle interval 1 sec, the mem bandwidth threshold value in MB per second,

--enable-mbm --mbm-latency-threshold=14000 --mbm-control-interval=1s

Which issue(s) this PR fixes:

NA - new feature

Special notes for your reviewer:

this PR is arranged roughly in 3 blocks of commits, hopefully reviewers can pick the relevant bock to provide feedbacks:

  1. ci job additions;
  2. mbw lib refactorings (mbw metrics collection part)
  3. metrics provisioner implementation
  4. mbw lib mem bandwidth adjustment part
  5. mbm adjustment control inside qrm plugin (the actual adjustment is via external manager)
codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 52.61845% with 570 lines in your changes missing coverage. Please review.

Project coverage is 56.70%. Comparing base (c9f1aaf) to head (647058d). Report is 15 commits behind head on main.

Files Patch % Lines
pkg/mbw/monitor/monitor.go 39.54% 96 Missing and 11 partials :warning:
pkg/mbw/monitor/controller.go 21.97% 68 Missing and 3 partials :warning:
pkg/mbw/monitor/umc.go 0.00% 55 Missing :warning:
...agent/qrm-plugins/cpu/dynamicpolicy/mbm/control.go 57.84% 35 Missing and 8 partials :warning:
pkg/mbw/monitor/l3pmc.go 59.74% 28 Missing and 3 partials :warning:
pkg/mbw/utils/pci/pciutils.go 50.00% 29 Missing and 2 partials :warning:
pkg/mbw/utils/helper.go 79.72% 21 Missing and 8 partials :warning:
pkg/mbw/monitor/rdt.go 28.20% 26 Missing and 2 partials :warning:
pkg/mbw/monitor/monitor_util.go 0.00% 26 Missing :warning:
pkg/util/machine/extension.go 59.32% 21 Missing and 3 partials :warning:
... and 18 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #628 +/- ## ========================================== + Coverage 56.62% 56.70% +0.07% ========================================== Files 544 571 +27 Lines 51408 52761 +1353 ========================================== + Hits 29108 29916 +808 - Misses 18603 19095 +492 - Partials 3697 3750 +53 ``` | [Flag](https://app.codecov.io/gh/kubewharf/katalyst-core/pull/628/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kubewharf) | Coverage Δ | | |---|---|---| | [unittest](https://app.codecov.io/gh/kubewharf/katalyst-core/pull/628/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kubewharf) | `56.70% <52.61%> (+0.07%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kubewharf#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.