Proposal: Intel RDT/MBA support for OCI/runc and Docker

xiaochenshen commented 7 years ago

The descriptions of Intel RDT/MBA features, user cases and Linux kernel interface are
heavily based on the Intel RDT documentation of the Linux kernel:

https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

Thanks to the authors of the kernel patches:
* Vikas Shivappa <vikas.shivappa@linux.intel.com>
* Fenghua Yu <fenghua.yu@intel.com>
* Tony Luck <tony.luck@intel.com>

Status: Intel RDT/MBA support for OCI and Docker software stack

Intel RDT/MBA support in OCI (merged PRs):

TODO list - Intel RDT/MBA support in Docker:

3. Intel RDT/MBA support in containerd

4. Intel RDT/MBA support in Docker Engine (moby/moby)

5. Intel RDT/MBA support in Docker CLI

What is Intel RDT/MBA:

Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT). And Cache Allocation Technology (CAT) is another one. Please refer to the details of Intel RDT and Cache Allocation Technology (CAT) support for runc and Docker in #433 .

MBA hardware details could be found in the section 17.18 of Intel Software Developer Manual and Intel RDT Homepage.

MBA provides indirect and approximate throttle over memory bandwidth (b/w) for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth or memory bandwidth limit in MBps unit if MBA Software Controller is enabled (https://github.com/opencontainers/runc/pull/1919).

Linux kernel interface for Intel RDT/MBA:

In Linux 4.12 kernel and newer, Intel RDT/MBA is supported on some Intel Xeon platforms with kernel config CONFIG_INTEL_RDT. In Linux 5.1 kernel and newer, with kernel config CONFIG_X86_CPU_RESCTRL.

To check if MBA is enabled: $ cat /proc/cpuinfo Check if output have 'rdt_a' and 'mba' flags.

The Intel RDT kernel interface is documented as below, MBA and CAT make use of the same interface. https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

Intel RDT "resource control" filesystem hierarchy:

mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for runc, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279 . We could also make use of tasks and schemata configuration for memory b/w resource constraints.

The file tasks has a list of tasks that belongs to this group (e.g., " group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent.

The file schemata has a list of all the resources available to this group. Each resource (L3 cache, memory b/w) has its own line and format.

Memory b/w is per L3 cache domain. The schema format:

    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The examples for runc:

For example on a two-socket machine with two L3 caches where the minimum memory b/w of 10% with a memory b/w granularity of 10%. Tasks inside the container may use a maximum memory b/w of 20% on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

If MBA Software Controller is enabled through mount option "-o mba_MBps":

mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl`

We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit instead of "percentages". The kernel underneath would use a software feedback mechanism or a "Software Controller" which reads the actual bandwidth using MBM counters and adjust the memory bandwidth percentages to ensure: "actual memory bandwidth < user specified memory bandwidth".

For example, on a two-socket machine, the schema line could be "MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0 and 7000 MBps memory bandwidth limit on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=5000;1=7000"
    }
}

cyphar commented 7 years ago

First of all, this needs a PR against runtime-spec (as did RDT/CAT). Secondly, this schema:

Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

Doesn't make sense to me in the context of JSON. Why not make it an array (or map) and then generate this in our code rather than adding this weird type information in a string?

xiaochenshen commented 7 years ago

@cyphar

First of all, this needs a PR against runtime-spec (as did RDT/CAT).

Yes, I will submit a PR in runtime-spec soon. Thank you.

Doesn't make sense to me in the context of JSON. Why not make it an array (or map) and then generate this in our code rather than adding this weird type information in a string?

The string "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." is just the MBA schema in schemata file in kernel.

The number of cache domains in MBA schema heavily depends on Intel Xeon CPU hardware topology. For example, we could run the runc container on a server with 1-socket, 2-socket or 4-socket which means we have 1, 2 or 4 cache domains for the . The user who is interested in Intel RDT feature may try to know some CPU topology details firstly and then write appropriate MBA JSON config accordingly.

I am open with either single string or array/map format MBA JSON config. And also I'd like to hear other maintainers' and reviewers' opinions. The array/map format JSON may look like:

"linux": {
    "intelRdt": {
        "l3CacheSchema": "L3:0=7f0;1=1f",
        "memBwSchema": [
            {
                "cacheId": 0,
                "bwPercentage": 20
            },
            {
                "cacheId": 1,
                "bwPercentage": 70
            }
        ]
    }
}

In my opinion, the single string format MBA JSON config has some advantages:

(+) It is more straightforward for who is familiar with Intel RDT kernel interface, because it is kept as the same string as in kernel interface file.
(+) The update command support for MBA is simpler (e.g., runc update --mem-bw-schema "MB:0=10;1=80").
(+) If we support this for Docker in future. Docker will have a simpler docker run option (e.g., --mem-bw-schema) to support MBA.

And the drawbacks:

(-) The JSON config looks not as user-friendly as an array/map of MBA schema.

cyphar commented 7 years ago

@xiaochenshen There are several reasons why I don't like having an opaque string. Ultimately the runtime-spec maintainers are the ones that make a decision here, but I believe they'd agree with me:

Validation of the spec using a JSON schema (which we publish in releases) is not really possible for opaque strings. So there's no real way for a tool to automatically verify whether the string is correct (without writing code explicitly for it).
Users have to generate this string before calling down to runc (or any OCI configuration). While that might be fine for some users that are using runc (or whatever) interactively, scripts will have to generate the schema.
If the format is extended in the future, it's much less transparent when upgrades occur (in a JSON object you can add extra fields).

xiaochenshen commented 7 years ago

@cyphar

There are several reasons why I don't like having an opaque string. Ultimately the runtime-spec maintainers are the ones that make a decision here, but I believe they'd agree with me:

Make sense to me. Thank you.

xiaochenshen commented 6 years ago

@cyphar Do you mind if I submit a runc Pull Request with "unstructed opaque string" format for memBwSchema throughout 1.x spec lifetime for the "tradeoff" reasons?

Consistency and compatibility requirement throughout 1.x spec lifetime for existed l3CacheSchema in runtime-spec and runc.
All RDT resources (memory bandwidth and L3 cache) should have unified formats (e.g., "l3CacheSchema": "L3:0=7f0;1=1f", "memBwSchema": "MB:0=20;1=70").

Here is the background as below. Thank you for review.

@wking and I have a discussion for the format of l3CacheSchema and memBwSchema in https://github.com/opencontainers/runtime-spec/pull/932#discussion_r145567826

I don't think the spec is a good place to play with the config format, because now that we've cut 1.0.0 with the existing l3CacheSchema, we need to continue to support it until this spec hits v2.

we'd need to continue to support the deprecated l3CacheSchema throughout the 1.x spec lifetime.

My plan for runtime-spec part: https://github.com/opencontainers/runtime-spec/pull/932#discussion_r145599051

Firstly, I will address "L3 cache" and "memory bandwidth" with unified formats in single runtime-spec PR.
To support existed "l3CacheSchema" throughout 1.x spec lifetime, and to avoid confusion of deprecated property,
If we have requirement to change all Intel RDT resources into "structured schemata" in spec 2.0, I could open a new PR to slightly rework on appropriate time slot in the phase of spec 2.0.

xiaochenshen commented 6 years ago

ping @cyphar Could you help comment https://github.com/opencontainers/runc/issues/1596#issuecomment-339620354? Thank you.

caoruidong commented 3 years ago

@xiaochenshen Any progress on containerd or docker?

xiaochenshen commented 3 years ago

@caoruidong We have plan to support on containerd and Docker, but some dependencies in runc is still working in progress.

caoruidong commented 3 years ago

@xiaochenshen Do you mean https://github.com/opencontainers/runtime-spec/pull/1076? I see most of RDT feature PRs have been merged in runtime-spec

xiaochenshen commented 3 years ago

@caoruidong This is one of the reasons. Generally, we need to make the framework and APIs stable enough in runtime-spec and runc.

caoruidong commented 3 years ago

@xiaochenshen Thanks for the information.

opencontainers / runc