microsoft / Windows-Dev-Performance

A repo for developers on Windows to file issues that impede their productivity, efficiency, and efficacy
MIT License
434 stars 20 forks source link

Argon containers use only weak cores on systems with heterogeneous CPUs #113

Closed conioh closed 8 months ago

conioh commented 1 year ago

NOTE: Edited on 2023-07-11 to correct inaccurate data about ARM64 machines.

Windows Build Number

10.0.22621.0

Processor Architecture

AMD64 (also ARM64, details below)

Memory

64GB (also 32GB, 4GB, etc.)

Storage Type, free / capacity

Micron 3400 SSD, 929GB/1900GB + WDS400T3X0C, 1250GB/3700GB (not that it matter in any way; also various other devices with all kinds of SSD storage sizes)

Relevant apps installed

Containers feature enabled in Turn Windows features on or off.

Traces collected via Feedback Hub

Which WPR profile would you like me to record? Not that it would give you anything you can't get buy just running five PowerShell commands.

Isssue description

Argon containers use only weak cores on systems with heterogeneous CPUs.

We use Docker for our build environment. This is required for many reasons. For example, it's quote common for our code to be incompatible with past _and_ future Visual C++ versions because of new C++ features going in and bugfixes (or new bugs) modifying behavior of old code. Generally, people can always build the `master` branch on their host machine, but they can't build an older version (required when servicing an issue with an older but still supported version). But people can't refrain from upgrading Visual Studio because then the current code won't build. It's impractical to keep all the version of the Visual C++ build tools installed. So we have Docker images with the entire build environment, and a text file in the code repository pointing to the tag of the corresponding build tools Docker image. That way in order to build any commit from the product source repository we only need to pull the referred Docker image. This is also what our CI system does.

We make every effort to run Docker containers using process isolation (mostly - make sure our images start from base images compatible to the host), as our tests have shown that running under process has no overhead compared to running directly on the host (sometimes even slightly faster¹), while running under Hyper-V isolation has a significant overhead, among other problems.

Unfortunately we have recently discovered that when running on system with heterogeneous CPUs (Intel Alder Lake and Raptor Lake CPUs and ARM64 CPUs under certain conditions), the processes inside process-isolated containers utilize only the "weak" cores (E-core on Intel, LITTLE cores on ARM). See: https://github.com/docker/for-win/issues/13562

Using sophisticated debugging techniques we have also discovered that the issue is not with Docker/Moby but rather with Windows. See below in the reproduction section.

On all of our machines with Intel CPUs prior to Alder Lake, building inside process-isolated Docker containers runs as fast (or slightly faster¹, but on our Raptor Lake machines with Intel i9 13900H CPUs (with 6 P-cores = 12 logical cores + 8 E-core) we get the following results:

Host Process isolation
Debug 500s 1000s
Release 1000s 2500s

(The numbers are rounded averages. For example, Debug on the host actually takes 490s-497s.)

This it pretty awful. We found out it is actually faster to build on an older model of the same computer, from 2 iterations back, with an Intel i9 11900H CPU, since it doesn't have two kinds of cores and it actually uses all of them.

On the Raptor Lake aforementioned the 12 logical P-core do nothing while the 8 E-cores do all the work. We're assuming Release is more compute-intensive due to the optimizations and there we see x2.5 factor (which is close to the 40% CPU utilization claimed by some tools seeing 100% on 8 cores and practically nothing on 12 other cores), and Debug it not so compute-bound so there the slowdown factor is "only" x2.

A workaround we've considered is using Hyper-V isolation. It kind of works but not really. A complete table with Hyper-V isolation would be:

Host Process isolation Hyper-V isolation
Debug 500s 1000s 700s
Release 1000s 2500s -ICE-

The Debug build has a significant overhead compared to running directly on the host and compared to what process isolation should have been, but it's still better than what process isolation does on this machine. But the Release build just crashed MSBuild.exe for lack of memory.

You see, Hyper-V isolation has the "nice" property of causing MSBuild.exe to crash with screams about not enough memory when launching the containers with as few as --memory 64GB and refusing to run at all with, let's say, --memory 96GB:

> docker run --rm -it --mount "type=bind,src=$(Get-Location),dst=C:\whatever" -w C:\whatever --isolation=hyperv --cpu-count 20 --memory 96GB some-builder:some-tag powershell.exe -Command "build command"
docker: Error response from daemon: hcs::CreateComputeSystem a3270d2ab11086bbcaac433a660d5cbde47fea3d2f0349b9d2ed22a9839a76e9: The paging file is too small for this operation to complete.
screenshot ![image](https://github.com/microsoft/Windows-Dev-Performance/assets/10606081/a3879d58-6476-4a09-8241-82c816e17fbe)

That's a separate issue with Hyper-V isolation being less that useful but we're not here to solve that. The point is that using Hyper-V isolation isn't a valid workaround on the grounds of not working. Not that it would be a good workaround even if it had worked, on grounds of being slow and no reason for process isolation not to work properly.


¹ One element we have discovered that makes processes inside process-isolated containers to run slightly faster than directly on the host is less interference by certain security software. We assume there may be other causes. Generally we say that the performance of process-isolated docker containers is approximately equal to that of running directly on the host, except in pathological cases. Like the one we have here, unfortunately.

Steps to reproduce

Execute the following PowerShell commands:

[E:\]
> $sieve_URI = "https://github.com/kimwalisch/primesieve/releases/download/v11.1/primesieve-11.1-win-x64.zip" #or arm64
[E:\]
> Invoke-WebRequest -Uri $sieve_URI -OutFile "sieve.zip"
[E:\]
> Expand-Archive -Path ".\sieve.zip" -DestinationPath ".\sieve\"
[E:\]
> CmDiag.exe CreateContainer -Type ServerSilo -Id 11111111-1111-1111-1111-111111111111 -FriendlyName Foo
The container was successfully created. Its ID is: 11111111-1111-1111-1111-111111111111
The container will continue running until it is terminated. A new instance of cmdiag has been spun
up in the background to keep the container alive.

[E:\]
> CmDiag.exe Map 11111111-1111-1111-1111-111111111111 -ReadOnly "$PWD\sieve" "C:\sieve"
[E:\]
> CmDiag.exe Console 11111111-1111-1111-1111-111111111111 powershell.exe
Executing: powershell.exe
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows

PS C:\Windows\system32> C:\sieve\primesieve.exe 1e14
Sieve size = 256 KiB
Threads = 20
0%

Take a look at your favorite CPU utilization tool. I used Sysinternals Process Explorer. You're welcome to use Task Manager, perfmon.msc or whatever rocks your boat.

Dell XPS 9530 with i9 13900H, 64GB RAM, aforementioned Micron SSD:

image

If you're not sure which core is which you can hover over the core. Process Explorer, unlike Task Manager tell you which logical core belongs to which physical core. Or you can use Sysinternals Coreinfo:

coreinfo.exe output: ``` Logical Processor to Cache Map: **------------------ Data Cache 0, Level 1, 48 KB, Assoc 12, LineSize 64 **------------------ Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 **------------------ Unified Cache 0, Level 2, 1 MB, Assoc 10, LineSize 64 ******************** Unified Cache 1, Level 3, 24 MB, Assoc 12, LineSize 64 --**---------------- Data Cache 1, Level 1, 48 KB, Assoc 12, LineSize 64 --**---------------- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 --**---------------- Unified Cache 2, Level 2, 1 MB, Assoc 10, LineSize 64 ----**-------------- Data Cache 2, Level 1, 48 KB, Assoc 12, LineSize 64 ----**-------------- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 ----**-------------- Unified Cache 3, Level 2, 1 MB, Assoc 10, LineSize 64 ------**------------ Data Cache 3, Level 1, 48 KB, Assoc 12, LineSize 64 ------**------------ Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ------**------------ Unified Cache 4, Level 2, 1 MB, Assoc 10, LineSize 64 --------**---------- Data Cache 4, Level 1, 48 KB, Assoc 12, LineSize 64 --------**---------- Instruction Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64 --------**---------- Unified Cache 5, Level 2, 1 MB, Assoc 10, LineSize 64 ----------**-------- Data Cache 5, Level 1, 48 KB, Assoc 12, LineSize 64 ----------**-------- Instruction Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64 ----------**-------- Unified Cache 6, Level 2, 1 MB, Assoc 10, LineSize 64 ------------*------- Data Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64 ------------*------- Instruction Cache 6, Level 1, 64 KB, Assoc 8, LineSize 64 ------------****---- Unified Cache 7, Level 2, 2 MB, Assoc 16, LineSize 64 -------------*------ Data Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64 -------------*------ Instruction Cache 7, Level 1, 64 KB, Assoc 8, LineSize 64 --------------*----- Data Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64 --------------*----- Instruction Cache 8, Level 1, 64 KB, Assoc 8, LineSize 64 ---------------*---- Data Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64 ---------------*---- Instruction Cache 9, Level 1, 64 KB, Assoc 8, LineSize 64 ----------------*--- Data Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64 ----------------*--- Instruction Cache 10, Level 1, 64 KB, Assoc 8, LineSize 64 ----------------**** Unified Cache 8, Level 2, 2 MB, Assoc 16, LineSize 64 -----------------*-- Data Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64 -----------------*-- Instruction Cache 11, Level 1, 64 KB, Assoc 8, LineSize 64 ------------------*- Data Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64 ------------------*- Instruction Cache 12, Level 1, 64 KB, Assoc 8, LineSize 64 -------------------* Data Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64 -------------------* Instruction Cache 13, Level 1, 64 KB, Assoc 8, LineSize 64 ``` You can see that the first 6 pairs of cores share caches and have larger caches than the other 8 cores.
This also happens on ARM devices such as the Samsung Galaxy Book go (with Snapdragon 7c Gen 2, 4GB RAM) and the Surface Pro X (SQ1, 8GB RAM; SQ2, 16GB RAM): ![image](https://github.com/microsoft/Windows-Dev-Performance/assets/10606081/ebd3104f-36b4-49c7-8c45-786450a803be) On both these devices the problem manifests itself only when using Argon containers _and_ running on battery power. That is: | | AC power | Battery | |----------- |-------------------- |-------------------- | | Host | All cores utilized | All cores utilized | | Container | All cores utilized | Only LITTLE cores | Only when running inside a container and on battery power, the big cores spike for a moment, like on the Alder/Raptor Lake and then aren't utilized by the container anymore. If connected to AC power, they are utilized again and if disconnected they are used once more. (The small fluctuation after the spike are due the the 7c being extremely weak and even running Task Manager and a browser requires non-negligible CPU power. On the Surface Pro X is smoother after the spike.) The first six cores are the LITTLE ones and the other two are the big:
coreinfo.exe output: ``` Logical Processor to Cache Map: *------- Instruction Cache 0, Level 1, 32 KB, Assoc 4, LineSize 64 *------- Data Cache 0, Level 1, 32 KB, Assoc 4, LineSize 64 *------- Unified Cache 0, Level 2, 64 KB, Assoc 4, LineSize 64 ******** Unified Cache 1, Level 3, 1 MB, Assoc 16, LineSize 64 -*------ Instruction Cache 1, Level 1, 32 KB, Assoc 4, LineSize 64 -*------ Data Cache 1, Level 1, 32 KB, Assoc 4, LineSize 64 -*------ Unified Cache 2, Level 2, 64 KB, Assoc 4, LineSize 64 --*----- Instruction Cache 2, Level 1, 32 KB, Assoc 4, LineSize 64 --*----- Data Cache 2, Level 1, 32 KB, Assoc 4, LineSize 64 --*----- Unified Cache 3, Level 2, 64 KB, Assoc 4, LineSize 64 ---*---- Instruction Cache 3, Level 1, 32 KB, Assoc 4, LineSize 64 ---*---- Data Cache 3, Level 1, 32 KB, Assoc 4, LineSize 64 ---*---- Unified Cache 4, Level 2, 64 KB, Assoc 4, LineSize 64 ----*--- Instruction Cache 4, Level 1, 32 KB, Assoc 4, LineSize 64 ----*--- Data Cache 4, Level 1, 32 KB, Assoc 4, LineSize 64 ----*--- Unified Cache 5, Level 2, 64 KB, Assoc 4, LineSize 64 -----*-- Instruction Cache 5, Level 1, 32 KB, Assoc 4, LineSize 64 -----*-- Data Cache 5, Level 1, 32 KB, Assoc 4, LineSize 64 -----*-- Unified Cache 6, Level 2, 64 KB, Assoc 4, LineSize 64 ------*- Instruction Cache 6, Level 1, 64 KB, Assoc 4, LineSize 64 ------*- Data Cache 6, Level 1, 64 KB, Assoc 4, LineSize 64 ------*- Unified Cache 7, Level 2, 256 KB, Assoc 8, LineSize 64 -------* Instruction Cache 7, Level 1, 64 KB, Assoc 4, LineSize 64 -------* Data Cache 7, Level 1, 64 KB, Assoc 4, LineSize 64 -------* Unified Cache 8, Level 2, 256 KB, Assoc 8, LineSize 64 ```

Expected Behavior

All cores utilized.

Actual Behavior

Only "weak" cores are utilized.

This is the same behavior described in https://github.com/docker/for-win/issues/13562, but without Docker. Inbox CmDiag.exe is enough.

Additional information

It certainly doesn't seem to be an issue with the specific container engine, runtime or base image.

Due to the behavior on the ARM64 devices it might be related to power management. Perhaps the container is somehow "confused" about the power state or configuration?

SauliusZ87 commented 1 year ago

Temporary workaround would be to disable E-cores in bios. Then i think it would be faster with process isolation with P-cores, than Hyper-V with all cores.

conioh commented 1 year ago

@SauliusZ87: I've tried that and saw that when all the E-cores are disabled in the firmware settings indeed the container is scheduled to run on the P-cores, the only ones that are available. I didn't mention it in the issue because we don't consider it a viable alternative. For one, disabling the E-cores presumably disabled TXT.

Even worse, it significantly degrades the performance on the machine and its battery life, and running these build tasks inside the container isn't the only thing we do on them.

Until the issue is property resolved we prefer the trade-off in which we suffer the overhead of Hyper-V isolation during said builds but the rest of the time we have all 14 cores rather than speed up these builds but give up the 8 E-cores completely. Perhaps on a build server under constant load we would have gone the other way.

conioh commented 11 months ago

Since opening the issue we have made the following findings, both obviously related to scheduling:

  1. If we modify the process affinity only to the P-core (and no E-core) the process is scheduled to the P-core, but only to half of them, i.e. only one of every two hyper-threaded cores. It sometimes even switch the logical hyperthreaded core the same physical core, but never runs on both logical cores of the single physical core.

  2. If we set the process priority to above normal all the process is scheduled to all cores.

I hope that helps.

@SauliusZ87: That's far from a perfect workaround, but if you combine setting the processes to above normal priority and set a CPU limit (so it won't hog down all the CPU power and prevent other processes from running), it might be better than disabling the E-cores completely.

Process priority can be set via IFEO and it works for the Registry within the container and not only on the host.

AdamBraden commented 8 months ago

This issue has been redirected to the appropriate team:

https://github.com/microsoft/Windows-Containers/issues/397