Open xiaochenshen opened 7 years ago
First of all, this needs a PR against runtime-spec (as did RDT/CAT). Secondly, this schema:
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
Doesn't make sense to me in the context of JSON. Why not make it an array (or map) and then generate this in our code rather than adding this weird type information in a string?
@cyphar
First of all, this needs a PR against runtime-spec (as did RDT/CAT).
Yes, I will submit a PR in runtime-spec soon. Thank you.
Doesn't make sense to me in the context of JSON. Why not make it an array (or map) and then generate this in our code rather than adding this weird type information in a string?
The string "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
is just the MBA schema in schemata
file in kernel.
The number of cache domains in MBA schema heavily depends on Intel Xeon CPU hardware topology. For example, we could run the runc container on a server with 1-socket, 2-socket or 4-socket which means we have 1, 2 or 4 cache domains for the
I am open with either single string or array/map format MBA JSON config. And also I'd like to hear other maintainers' and reviewers' opinions. The array/map format JSON may look like:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=7f0;1=1f",
"memBwSchema": [
{
"cacheId": 0,
"bwPercentage": 20
},
{
"cacheId": 1,
"bwPercentage": 70
}
]
}
}
In my opinion, the single string format MBA JSON config has some advantages:
docker run
option (e.g., --mem-bw-schema) to support MBA. And the drawbacks:
@xiaochenshen There are several reasons why I don't like having an opaque string. Ultimately the runtime-spec maintainers are the ones that make a decision here, but I believe they'd agree with me:
Validation of the spec using a JSON schema (which we publish in releases) is not really possible for opaque strings. So there's no real way for a tool to automatically verify whether the string is correct (without writing code explicitly for it).
Users have to generate this string before calling down to runc
(or any OCI configuration). While that might be fine for some users that are using runc
(or whatever) interactively, scripts will have to generate the schema.
If the format is extended in the future, it's much less transparent when upgrades occur (in a JSON object you can add extra fields).
@cyphar
There are several reasons why I don't like having an opaque string. Ultimately the runtime-spec maintainers are the ones that make a decision here, but I believe they'd agree with me:
Make sense to me. Thank you.
@cyphar
Do you mind if I submit a runc Pull Request with "unstructed opaque string" format for memBwSchema
throughout 1.x spec lifetime for the "tradeoff" reasons?
l3CacheSchema
in runtime-spec and runc.Here is the background as below. Thank you for review.
@wking and I have a discussion for the format of l3CacheSchema
and memBwSchema
in https://github.com/opencontainers/runtime-spec/pull/932#discussion_r145567826
I don't think the spec is a good place to play with the config format, because now that we've cut 1.0.0 with the existing l3CacheSchema, we need to continue to support it until this spec hits v2.
we'd need to continue to support the deprecated l3CacheSchema throughout the 1.x spec lifetime.
My plan for runtime-spec part: https://github.com/opencontainers/runtime-spec/pull/932#discussion_r145599051
unified
formats in single runtime-spec PR. ping @cyphar Could you help comment https://github.com/opencontainers/runc/issues/1596#issuecomment-339620354? Thank you.
@xiaochenshen Any progress on containerd or docker?
@caoruidong We have plan to support on containerd and Docker, but some dependencies in runc is still working in progress.
@xiaochenshen Do you mean https://github.com/opencontainers/runtime-spec/pull/1076? I see most of RDT feature PRs have been merged in runtime-spec
@caoruidong This is one of the reasons. Generally, we need to make the framework and APIs stable enough in runtime-spec and runc.
@xiaochenshen Thanks for the information.
Status: Intel RDT/MBA support for OCI and Docker software stack
Intel RDT/MBA support in OCI (merged PRs):
1. Intel RDT/MBA support in OCI/runtime-spec
https://github.com/opencontainers/runtime-spec/pull/932
2. Intel RDT/MBA support in OCI/runc
https://github.com/opencontainers/runc/pull/1632 https://github.com/opencontainers/runc/pull/1913 https://github.com/opencontainers/runc/pull/1930 https://github.com/opencontainers/runc/pull/1955 https://github.com/opencontainers/runc/pull/2042
3. Intel RDT/MBA Software Controller support in OCI/runtime-spec
https://github.com/opencontainers/runtime-spec/pull/992
4. Intel RDT/MBA Software Controller support in OCI/runc
https://github.com/opencontainers/runc/pull/1919
TODO list - Intel RDT/MBA support in Docker:
3. Intel RDT/MBA support in containerd
4. Intel RDT/MBA support in Docker Engine (moby/moby)
5. Intel RDT/MBA support in Docker CLI
What is Intel RDT/MBA:
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT). And Cache Allocation Technology (CAT) is another one. Please refer to the details of Intel RDT and Cache Allocation Technology (CAT) support for
runc
andDocker
in #433 .MBA hardware details could be found in the section 17.18 of Intel Software Developer Manual and Intel RDT Homepage.
MBA provides indirect and approximate throttle over memory bandwidth (b/w) for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth or memory bandwidth limit in MBps unit if MBA Software Controller is enabled (https://github.com/opencontainers/runc/pull/1919).
Linux kernel interface for Intel RDT/MBA:
In Linux 4.12 kernel and newer, Intel RDT/MBA is supported on some Intel Xeon platforms with kernel config CONFIG_INTEL_RDT. In Linux 5.1 kernel and newer, with kernel config CONFIG_X86_CPU_RESCTRL.
To check if MBA is enabled: $ cat /proc/cpuinfo Check if output have 'rdt_a' and 'mba' flags.
The Intel RDT kernel interface is documented as below, MBA and CAT make use of the same interface. https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
Intel RDT "resource control" filesystem hierarchy:
For MBA support for
runc
, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279 . We could also make use oftasks
andschemata
configuration for memory b/w resource constraints.The file" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent.
tasks
has a list of tasks that belongs to this group (e.g.,The file
schemata
has a list of all the resources available to this group. Each resource (L3 cache, memory b/w) has its own line and format.Memory b/w is per L3 cache domain. The schema format:
The examples for runc:
For example on a two-socket machine with two L3 caches where the minimum memory b/w of 10% with a memory b/w granularity of 10%. Tasks inside the container may use a maximum memory b/w of 20% on socket 0 and 70% on socket 1.
If MBA Software Controller is enabled through mount option "-o mba_MBps":
We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit instead of "percentages". The kernel underneath would use a software feedback mechanism or a "Software Controller" which reads the actual bandwidth using MBM counters and adjust the memory bandwidth percentages to ensure: "actual memory bandwidth < user specified memory bandwidth".
For example, on a two-socket machine, the schema line could be "MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0 and 7000 MBps memory bandwidth limit on socket 1.