microsoft / Windows-Containers

Welcome to our Windows Containers GitHub community! Ask questions, report bugs, and suggest features -- let's work together.
MIT License
407 stars 63 forks source link

`--isolation=process` fails sporadically on machines using SplunkForwarder #374

Open augusew1 opened 1 year ago

augusew1 commented 1 year ago

UPDATE

Please see this comment for the latest on this bug

Original Report:

Describe the bug --isolation=process does not work on Windows Server 2022 with an equivalent image of windows-servercore

To Reproduce

  1. Install the latest docker engine using the script from this repository
  2. Do docker pull mcr.microsoft.com/windows/servercore:ltsc2022
  3. Verify that the docker image is the same build number as the container host
  4. Do docker run -it mcr.microsoft.com/windows/servercore:ltsc2022

The container will fail with the following error:

docker: Error response from daemon: container 08a82aa4af2577260662b370a64b7b1584430103cf17c007971f73bdbac58cdc encountered an error during hcs::System::Start: context deadline exceeded. 

Expected behavior The container should run

Configuration:

Output of docker info:

Client:
 Version:    24.0.2
 Context:    default
 Debug Mode: false

Server:
 Containers: 13
  Running: 0
  Paused: 0
  Stopped: 13
 Images: 21
 Server Version: 24.0.2
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500)
 Operating System: Microsoft Windows Server Version 21H2 (OS Build 20348.1668)
 OSType: windows
 Architecture: x86_64
 CPUs: 6
 Total Memory: 12GiB
 Name: aplops-win-02
 ID: e324222e-b334-4a15-ba5a-734bec1f5d5a
 Docker Root Dir: C:\ProgramData\Docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

Additional context

Currently, our best guess is some combination of the following is creating an issue:

We are working on testing this on a bare metal machine to see if ESXi issues can be ruled out.

ntrappe-msft commented 1 year ago

Hi! Thanks for posting your issue.

When I usually see errors with hcs:System::Start, they're often because of resource constraints (e.g. too little memory), networking, or incompatible host/container types. I did a very simple repro using a Windows Server Core LTSC 2022 container using the flag --isolation=process on Windows Server 2022 Data Center (10.0.20348). That was successful.

However, it's interesting that you mentioned everything worked fine before the May 2023 (KB5026370) update. I'm going to create an internal ticket (#44926683) to see whether the update is causing any issues or if we can suggest any mitigation strategies.

We'll keep you posted when we have an update.

augusew1 commented 1 year ago

Resources on this system should not be an issue, it as 12 GB of RAM and 6 vCPU cores. Space on the disk should not be an issue either, it is ~100GB. We have still not had a chance to reproduce the issue on bare metal, i.e. without ESXi.

ntrappe-msft commented 1 year ago

I agree that it doesn't seem like a resource issue. I used fewer cores and less memory and it started up for me. If you feel up to the challenge, there are 2 things you could try doing. I'll keep trying to repro your error in the meantime.

Task 1: Get a fresh Windows Server Core 2022 image.

Task 2: See what's timing out and blocking the container start up process.

Sidenote. I know that the second one can be confusing. We don't expect our customers to troubleshoot their own problems, but, if you CAN find the processes related to your container session and share them, we may get some insights as to what's going on.

augusew1 commented 1 year ago

Task 1 we have attempted (including spinning up a brand new, fresh VM and installing docker from scratch). As for task 2, I am more than willing to troubleshoot, my issue is that I don’t know where to start. This is good information that I did not know and I will attempt it this week. We are also working in parallel to stand up our WS2022 image on some loaner hardware to see if it’s something about the VM that is causing this. Unfortunately, due to internal policy requirements, this will take some time.

One question I did have: hcsshim seems to have some specific troubleshooting guidelines, are any of these relevant here? I know https://github.com/microsoft/hcsshim exists as well.

augusew1 commented 1 year ago

I did get a hit on the Session ID. It looks like a process called smss.exe starts every time I run docker, in a brand new session.

snss exe

augusew1 commented 1 year ago

@ntrappe-msft We were able to test this on some loaner hardware, just a laptop. It was not connected to our domain so some security software (in monitor-only mode) was not running, but it should be a 1:1 test otherwise. The test of docker run --isolation=process with this fresh laptop worked.

With the release of KB5027225 and KB5027544, our test VM was updated to include these. Furthermore, a new Windows Server Core ltsc2022 image was released (a0cb8066ecaf ). We removed all docker images on the VM and re-downloaded them to match what was on the laptop. Still, process isolation did not work. The docker version was still 24.0.2 on both.

Presently, we have the following situation (assuming ltsc2022@a0cb8066ecaf on both):

So, we currently believe this to be some bug with WS2022 guests on ESXi. While we do have some evidence that this was working before KB5026370, we have not been able to reproduce a working setup by just reverting the KB patch. Any advice for further debugging would be greatly appreciated, as I think we are stuck on our end. The timeout is still happening with smss.exe being the only process stated in that session.

Another thing to add: I am more than willing to do some deep, technical troubleshooting and provide logs or additional information relevant about our infrastructure. But as of right now, I don't really know what is useful or relevant.

ntrappe-msft commented 1 year ago

Hmm ok this is interesting. I'll do a bit more digging with the new information you've given. Thanks for being so open to help debug, I may take you up on that if I can't repro the error you're getting.

ntrappe-msft commented 1 year ago

We're currently waiting on a VM with the VMWare hypervisor to repro on our end. Thanks for your patience.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

ntrappe-msft commented 1 year ago

Still working on this. I was OOF for 2 weeks so I apologize for the delay.

augusew1 commented 1 year ago

I appreciate the update, we've also been working on our end to open up a ticket with VMWare, it has taken our virtualization team a lot of convincing and back-and-forth to convince them this is a bug on their end.

ntrappe-msft commented 1 year ago

Hi, I'm currently waiting on an Azure VM running VMware. Once I have that, we'll try to repro your bug!

microsoft-github-policy-service[bot] commented 12 months ago

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

augusew1 commented 11 months ago

Hi @ntrappe-msft ,

Just wanted to follow up with what we've found on our end. We have reached out to VMWare, but have been told that we must get support via Docker, as VMWare doesn't support individual applications, even with their production support. They said Docker would have to troubleshoot with them directly, after we confirm with their support that this is a VMWare issue. In truth, while evidence points to it, we don't exactly know if VMWare is the real issue. We can't debug any further on that end.

They also pointed us to this documentation from Docker claiming that nested virtualization is "not supported". (I know that page is for Docker desktop, but IIRC it's still running the same container runtime underneath, perhaps with some proprietary changes?).

The above is problematic for us for a couple reasons:

Given the above, we have concluded that Docker on Windows with nested virtualization is too difficult to support for our company at this time, and have instead procured physical hardware. Feel free to leave this issue open or close, we currently have no reason to believe this is a Microsoft issue. See below

augusew1 commented 11 months ago

I apologize for backtracking, but now that we have the physical hardware set up, it seems as if we were wrong. Process isolation still is not working. Since this has nothing to do with VMWare, we're going back to potentially investigation our security/compliance software.

Unfortuantely there's not much I can share about our infrastructure without permission, but please see the below:

System Info ## Docker info ``` Client: Version: 24.0.6 Context: default Debug Mode: false Server: Containers: 7 Running: 0 Paused: 0 Stopped: 7 Images: 31 Server Version: 24.0.6 Storage Driver: windowsfilter Windows: Logging Driver: json-file Plugins: Volume: local Network: ics internal l2bridge l2tunnel nat null overlay private transparent Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog Swarm: inactive Default Isolation: process Kernel Version: 10.0 20348 (20348.1.amd64fre.fe_release.210507-1500) Operating System: Microsoft Windows Server Version 21H2 (OS Build 20348.1970) OSType: windows Architecture: x86_64 CPUs: 40 Total Memory: 255.9GiB Name: aplops-win-1 ID: 6bb5c255-fbff-434d-8de7-08b1e8b8e59e Docker Root Dir: F:\docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false Product License: Community Engine ``` ## Servercore image ``` mcr.microsoft.com/windows/servercore ltsc2022 2b76bb46ced7 2 weeks ago 4.11GB ``` ## System specs ### Processor information ``` Caption : Intel64 Family 6 Model 63 Stepping 2 DeviceID : CPU0 Manufacturer : GenuineIntel MaxClockSpeed : 2600 Name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz SocketDesignation : CPU1 Caption : Intel64 Family 6 Model 63 Stepping 2 DeviceID : CPU1 Manufacturer : GenuineIntel MaxClockSpeed : 2600 Name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz SocketDesignation : CPU2 ``` ### Memory 256 GB ### Storage 2 TB HDD ### OS Version Microsoft Windows [Version 10.0.20348.1970]

I would greatly appreciate some more information on HCS, specifically what syscalls or Windows features its calling in to that could be affected by:

ntrappe-msft commented 11 months ago

Thank you for doing such a thorough investigation and attempting to resolve this on your end. Our intuition is that this probably isn't an issue with the VMware hypervisor but something else. I'll try to repro this using all the specs you provided and hopefully we can see your error. Otherwise, I'll work on getting you the information we're allowed to share on HCS.

augusew1 commented 11 months ago

Hi @ntrappe-msft

Curious update that may either a) make everything clearer on Microsoft's side, or b) make everything more confusing on Microsof'ts side. Hopefully the former! I have been able to reproduce getting --isolation=process to work and to fail. Once our machine is rebooted, --isolation=process will fail. However, if we install and then remove Hyper-V as a Windows Feature, then reboot, it will work.

Reproduction steps

  1. Reboot machine
  2. Confirm --isolation=process times out with hcs::System::Start: context deadline exceeded.
  3. Confirm Hyper-V is OFF
  4. Run: Install-WindowsFeature -Name Hyper-V -Restart
  5. Once rebooted, run Remove-WindowsFeature -Name Hyper-V -Restart
  6. Once the machine has rebooted again, try docker, and --isolation=process will work.

I have successfully tested this 3 times in a row. I'm pretty baffled, but it does work. Note that process isolation still feels slow, it takes about 30 seconds to a minute to actually start the container. I am guessing this is security software related.

EDIT: Correction, rebooting the machine between installing/removing Hyper-V is consistent. Installing/Removing in the same boot sometimes works.

ntrappe-msft commented 9 months ago

Sorry for the delay, I'm working on using a separate VMware hypervisor to try to repro this. I'll update the Issue next week with my results.

augusew1 commented 9 months ago

Hi @ntrappe-msft,

We have been doing some renewed debugging at our company looking in to exactly when/how docker is failing, and we've made some progress. To start: it's confirmed not a VMWare issue, or at least not only a VMWare issue.

Second, while toggling Hyper-V on/off does seem to work, it's not as consistent as I first implied. In fact, what really seems to matter is when docker --isolation=process is run. From this, and issue #426, we were able to narrow it down to it being the SplunkForwarder service affecting docker. It appears to be a race condition where if you run docker before this service starts, it will work, and if you run docker after it starts, it will fail. The work around in #426 also works, that is:

  1. Stop SplunkForwarder service
  2. Run docker run --isolation=process .... (confirm that it works and the container loads)
  3. Start SplunkForwarder service

Docker now works for the remainder of the boot, i.e. this process needs to be repeated once per boot. For now, we are switching debugging efforts to investigate what these services possibly have in common with one another and where they overlap. As far as we can tell, SplunkForwarder only needs a port or two open, and does not use any virtualization/containerization features. So this is perhaps a firewalling issue, but it is not obvious why this would cause docker to completely fail. Any insight on Microsoft's side on how this could happen is appreciated. If @brwilkinson has any insights on his end that would also be appreciated. Hopefully we can all share notes and figure this out.

ntrappe-msft commented 9 months ago

Hi, thank you for all the work you're doing debugging! I'm trying to repro the race condition now. Out of curiosity, can I ask what you're using the SplunkForwarder service for?

ArtyomBaranovskiy commented 8 months ago

Just accidentally spotted this, I can confirm having same issue now for years with our windows containers workloads, bottomed down to the splunkforwarder workaround long time ago. We're still using win server 1809 and ltsc1809 container images, got nothing from engaging splunk support and seeing this discussion - seems like doomed to keep using the stop-start workaround in the newer windows versions as well. @ntrappe-msft the company I work for uses splunk forwarder service for ingest logs from hosts into splunk.

ntrappe-msft commented 8 months ago

Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up?

ntrappe-msft commented 8 months ago

We have significant evidence that Splunk prevents docker from starting containers. Here are some mitigation tips while we try to investigate what's going wrong:

Tip 1:

Tip 2:

ArtyomBaranovskiy commented 8 months ago

Thanks for confirming the same problem and that there's a definite issue with the SplunkForwarder service and running containers. Did you happen to see the same behavior in that a container fails to start up if docker is run after the SplunkForwarder service starts up?

@ntrappe-msft yes - the containers always fail to start if splunk forwarder is running, however I am only speaking from win server 2019 experience. The reliable way to start any container in my case is to stop splunk forwarder, then start containers, then start splunk forwarder. What you're suggesting in "Tip 1" above certainly doesn't work.

In Tip 2 do you actually mean windows event logs? (we do ingest all event logs too into splunk - there are cases where it's useful, like searching for the certain application crash error across the fleet of hosts).

brwilkinson commented 8 months ago

Can we please update the title of this issue to map to the current findings?

ntrappe-msft commented 7 months ago

@ArtyomBaranovskiy I'm sorry to hear that even Tip #1 doesn't work for you (for some people they've had success). And yes for #2 the Windows event logs.

ntrappe-msft commented 7 months ago

Thanks for everyone's patience. We need to understand more about how the SplunkForwarder could be interfering with Docker so we've reached out to our contacts at Splunk. I'll update this when I learn more.

augusew1 commented 7 months ago

I didn't want to update this issue before actually testing it myself, but we've found another clue, but since @ntrappe-msft is reaching out to Splunk, then I think this is relevant to bring up. Splunk may be interfering with named pipes, see this page.

Once we have the chance to, I am going to try to use a TCP socket to run docker containers on and see if that works without any workarounds. I encourage anyone who is able to try this to do it. Our main machine is under heavy use during the week but I was going to attempt it this weekend.

augusew1 commented 7 months ago

Listening on a TCP socket did not work. I still got the "context exceeded deadline" error. So, Splunk is somehow interfering with docker at a lower level than the host socket.

fady-azmy-msft commented 7 months ago

If this is an issue with the splunk forwarder, it would be more appropriate to raise this on the appropriate channels with them.

ntrappe-msft commented 7 months ago

@augusew1 Sorry this has been such a lengthy process. I just spoke to a Splunk engineer who said we have to follow the formal channels to get support. Would you mind raising a case with Splunk about the problem you're experiencing? You can me to the case so Splunk has a Microsoft contact if they need our support.

augusew1 commented 7 months ago

I appreciate the update. We are going through formal channels with Splunk and we'll forward this information to them.

nickcva commented 7 months ago

@augusew1 I had a case open on cascading container failures and crashing containers that could never recover with Microsoft for 8 months with zero progress made. The only thing that work for me was Hyper-V isolation. We had servers with 128 cores and 2tb of ram that could only run 20 containers with a max of 3 SHIR jobs per instance under process isolation it was a huge waist of machines. But with Hyper-V isolation the issues magically disappears.

https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-Windows-Container/issues/7

ntrappe-msft commented 7 months ago

@nickcva I’m sorry to hear that you got no support. I don't think that repo is very active. Could you raise an Issue on this GitHub repo with information about your host, repro steps, behavior, etc.? I’ll make sure we give it some attention.

augusew1 commented 6 months ago

We got an update from Splunk support, they have investigated this issue before and concluded that it must be a Docker issue, not a Splunk one. Specifically that "the conflict causing Docker to become unstable does not stem from a defect or bug within [Splunk]". Therefore they advised us to reach out to Docker support about this. Our company does not have a contract with Docker as far as I know, so that might be a dead end for us if I can't convince someone to get support.

I looked more into Docker and did discover that they have logging plugin for Splunk. However we are not using this, as seen by docker info:

 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog

Still, I wonder if the presence of Splunk along with the logging plugin could possibly be causing this?

microsoft-github-policy-service[bot] commented 5 months ago

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

ntrappe-msft commented 4 months ago

Hi, thank you again for being so persistent and trying to tackle this Issue. I also could not make forward progress with Splunk but I'm so glad you at least found that Docker may be the main culprit. I'm going to ping some of the people I know at Docker to see if they're aware of this Issue and if there's a mitigation strategy they'd recommend.

ntrappe-msft commented 4 months ago

@augusew1 Alrighty! Some very nice people at Docker are taking a look at this.

akerouanton commented 4 months ago

Hi @augusew1,

I've tried to reproduce this issue on a Windows 11 Pro 23H2 (build 22631.3296), with SplunkForwarder started before dockerd.exe and no specific config. Unforunately, I can't reproduce. I think your Splunk Forwarder config might be at play here. First, what version of Splunk Forwarder are you running? Could you tell us what did you set in $SPLUNK_HOME/etc/system/local/ (or any other config file), and what splunk.exe commands you executed to set up the forwarder? Did you configure anything specific to Docker service logs, etc...?

You could try to disable all your inputs and then re-enable them one by one until you find the culprit.

The error message you reported initially indicates a timeout during a call to vmcompute.dll. It seems this call returns a response asynchronously but the error message gives no clue about whether the timeout happen during the request or while waiting for a response. I'll follow-up with @ntrappe-msft and see if we can find some debugging command that could help pinpoint what's deadlocked within vmcompute.dll.

ArtyomBaranovskiy commented 4 months ago

Sorry @akerouanton, maybe I'm missing smth, but the original issue is about windows server running windows containers in process isolation. Afaik, Docker on windows 11 only supports hyper-v isolation - so not surprised you cannot reproduce the issue. https://learn.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/hyperv-container Based on this other page maybe I'm wrong - so that might mean the newer versions (23H2?) of windows already have fixed the issue? I'm following this topic as for me issue still exists in win server 2019, and also for the original author in win server 2022 as it seems. https://learn.microsoft.com/en-us/virtualization/windowscontainers/about/faq#how-are-containers-licensed--is-there-a-limit-to-the-number-of-containers-i-can-run-

akerouanton commented 4 months ago

@ArtyomBaranovskiy The last section of the 1st link you posted shows that containerized processes running with process isolation appear on the host, unlike when run with Hyper-V isolation.

If I launch ping in a container started with --isolation=process and then do Get-Process on the host here's what I get:

PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID

  Id ProcessName SessionId
  -- ----------- ---------
6972 PING                2

When ping run in a container started without --isolation=process:

PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
Get-Process : Cannot find a process with the name "ping". Verify the process name and call the cmdlet again.
At line:1 char:1
+ Get-Process -Name ping | Format-Table Id, ProcessName, SessionID
+ ~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (ping:String) [Get-Process], ProcessCommandException
    + FullyQualifiedErrorId : NoProcessFoundForGivenName,Microsoft.PowerShell.Commands.GetProcessCommand

And when ping run on the host:

PS C:\Users\akerouanton> Get-Process -Name ping | Format-Table Id, ProcessName, SessionID

   Id ProcessName SessionId
   -- ----------- ---------
10224 PING                0

Notice that the Session ID isn't the same between the first and the last example. This confirms I'm running with process isolation 😉

Unfortunately I don't have access to Windows Server right now but I'm not entirely sure this really matters.

augusew1 commented 4 months ago

@akerouanton

Docker is not configured with anything extra, the full contents of the daemon's config is:

{"hosts":  ["npipe://"], "group": "docker"}

We do not do any custom logging of individual containers, they are run as-is:

docker run -it mcr.microsoft.com/windows/servercore:ltsc2022

In particular, we don't have any long running containers running any services, they are used for Gitlab's CI/CD. But to be clear, this issue is reproducible outside of gitlab just running a shell.

Splunk is run as a system service, run as "C:\Program Files\SplunkUniversalForwarder\bin\splunkd.exe" service. All of the configuration is in C:\Program Files\SplunkUniversalForwarder. Our server.conf is rather minimal, this is the entire contents with some sensitive fields redacted:

[general]
serverName = REDACTED
pass4SymmKey = REDACTED

[sslConfig]
sslPassword = REDACTED

[lmpool:auto_generated_pool_forwarder]
description = auto_generated_pool_forwarder
peers = *
quota = MAX
stack_id = forwarder

[lmpool:auto_generated_pool_free]
description = auto_generated_pool_free
peers = *
quota = MAX
stack_id = free

However, in taking with our IT department I was given the following advice:

All of the contents of /etc/apps will also provide changes over Splunk’s default config. So if you want a full list of active settings in server.conf you can run this from command line:

C:\Program Files\SplunkUniversalForwarder\bin\splunk.exe btool server list –debug

This will output both default settings as well and changes we have made and where these changes were done. We use “apps” to manage configs so there won’t be much in the /etc/system/local/server.conf file.

The output of that command is exceptionally large, and I can't tell what's a default config and what's a modified value, so I'm not sure how useful it is. The next thing I will try is working with our IT department to disable configs in Splunk to see if one particular configuration is causing issues.

microsoft-github-policy-service[bot] commented 3 months ago

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

Masklyne commented 3 months ago

I am seeing this issue as well-We disabled all of the apps being used by the Forwarder and still the issue remained. This was seen where the process was set to isolation for the Windows container being run. We are using Docker as well

Masklyne commented 3 months ago

We are encountering this issue in our environment. Did this thread die as there was a solution or work around found or is this still being looked at? Thanks!

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 2 weeks ago

This issue has been open for 30 days with no updates. @ntrappe-msft, please provide an update or close this issue.