avin3sh commented 1 year ago

Describe the bug When running multiple containers simultaneously using the same gMSA on either the same host or different hosts, it causes one or multiple containers to lose their domain trust relationship leading to various issues including LsaLookUp and negotiate auth failures. This especially happens when the count of containers is equal to or more than count of domain controllers in the environment. However, it is also possible to run into this issue when the count of containers is less than count of domain controllers in the environment, provided two or more containers attempt to talk to the same domain controller.

To Reproduce

Build an image from the following Dockerfile

FROM mcr.microsoft.com/dotnet/aspnet:6.0-windowsservercore-ltsc2019 AS base

USER ContainerAdministrator
RUN reg.exe add "HKLM\SYSTEM\CurrentControlSet\Control\Lsa" /v LsaLookupCacheMaxSize /t REG_DWORD /d 0 /f

USER ContainerUser
ENTRYPOINT ["powershell.exe", "1..500 | %{ [void][System.Security.Principal.NTAccount]::new('contoso\\someobj').Translate([System.Security.Principal.SecurityIdentifier]).Value; Start-Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000); }"]

Replace contoso\someobj above with sam name of an actual object.

Run the container image simultaneously on multiple hosts using the following command. To increase the chances of running into the issue, if there are N domain controllers in the environment, run the container image simultaneously on at least N+1 hosts
```
docker run --security-opt "credentialspec=file://gmsa-credspec.json" --hostname <gMSAName>  -it <image>
```
Replace <gMSAName> with actual gMSA and file://gmsa-credspec.json with actual gMSA Credential Spec file and <image> with the container image
Monitor the output of all the containers, eventually one or more containers will start throwing the following error message. This usually happens within first few seconds of the container starting, assuming the docker run ... in (2) above was run simultaneously on different hosts. If it does not happen, repeat (2) until it does.

Exception calling "Translate" with "1" argument(s): "The trust relationship between this workstation and the primary domain failed.

While a running container is throwing the above error message in its output, exec into it and try performing some domain operation - that will fail as well.

Expected behavior gMSAs on multiple Windows Containers is officially supported since at least Windows Server 2019. Running gMSA on multiple containers simultaneously should not result in trust relationship to fail.

Configuration:

Edition: Windows Server 2022
- Base Image being used: Windows Server Core
Container engine: docker

Additional context

While the reproducer uses a PowerShell base image to demonstrate the bug, we had originally run into this issue in an ASP.NET Core web application while performing negotiate authentication.
The container image in the reproducer purposefully disables LSA LookUp Cache by setting LsaLookupCacheMaxSize to 0 to simplify the example.
If you were to observe traffic of a container that has run into this issue, the packet capture will indicate a lot of DSERPC/RPC_NETLOGON failure messages. You may also observe packets reporting nca_s_fault_sec_pkg_error
Sometimes the container may "autorecover". It is purely a chance event. Whenever this happens, you can see RPC_NETLOGON packets in the network capture. Typically this results in the container recovering its domain trust relationship only when the NETLOGON happens through a different domain controller than what container had earlier communicated to.
It is also possible to re-establish domain trust relationship of a failing container by running the following command in a failing container (the runtime user should be a ContainerAdministrator or should have administrators privileges)
```
nltest.exe /sc_reset:contoso.com
```
If the above command does not succeed, you may have to run it more than once. When the command succeeds, more often than not, all the affected containers and not just the current container "recover".
As mentioned in the bug description, it is very easy to run into this issue when the count of containers is more than the number of domain controllers in the environment but that is not the only scenario.
docker run ... is not the only way to run into this issue. It can be also be reproduced on an orchestration platform like Kubernetes, by setting replicas count of the Deployment to N+1; or by using scaling feature.

ntrappe-msft commented 1 year ago

Hi, thanks for bringing this issue to our attention. First, I've have to give credit where credit is due. This is so well written up! Thank you for providing a very clear description of the current and expected behavior.

Second, this is a quick question: Is there a reason why all the containers in this cluster all have the same gMSA?

avin3sh commented 1 year ago

Is there a reason why all the containers in this cluster all have the same gMSA?

We actually don't use the same gMSA for all the containers in the cluster. Different type of application containers run with different gMSAs.

The problem arises when there are multiple instances (replicas) of the same application, such as an application that requires to be highly available. During my testing I also found that it does not have to be replicas of same container image/deployment, different containers still running as the same gMSA will also run into this issue.

Multiple containers running as same gMSA can't be avoided for these purposes - without them we can't distribute our workload or promise high availability.

avin3sh commented 1 year ago

@ntrappe-msft has there been an internal confirmation of this bug and any discussions on a fix ? This issue severely limits ability to scale Windows containers and use AD authentication because of direct relation between number of containers and domain controllers.

ntrappe-msft commented 1 year ago

Hi, thank you for your patience! We know this is blocking you right now and we're working hard to make sure it's resolved as soon as possible. We've reached out to the gMSA team to get more context on the problem and some troubleshooting suggestions.

ntrappe-msft commented 1 year ago

The gMSA team is still doing their investigation but they can confirm that this is unexpected and unusual behavior. We may ask for some logs in the future if it would help them diagnose the root cause.

ntrappe-msft commented 1 year ago

Hi, could you give us a few follow-up details?

Are you using process-isolated or hyper-v isolated containers?
Are you using the same container hostname and gMSA name?
What is the host OS version?

avin3sh commented 1 year ago

Hi Nicole @ntrappe-msft

Are you using process-isolated or hyper-v isolated containers?

Process Isolation

Are you using the same container hostname and gMSA name?

Correct

What is the host OS version?

Microsoft Windows Server 2022 Standard (Core), with October CU applied

Sharing some more data from our experiments, in case it help the team to troubleshoot the issue:

When all the containers of a gMSA are given a different, unique, value for the hostname, at least the Domain Trust Relationship error goes away - although that may have broken something else, we did not look in that direction. However;
If the value of hostname for each container is >15 characters in length, and the value is unique BUT first 15 characters are not-unique, we again start seeing the issue related to Domain Trust Relationship. This interestingly coincides with 15 character length limit for computer name / NETBIOS limitation.

This means if you have a very long value of hostname and first few characters are not unique, gMSA issues start occurring in multi-container scenario.

If you were to use some container orchestration solution, like Kubernetes, the value of pod name, which is what gets supplied as hostname value to the container runtime, is in all the realistic scenarios >15 characters and the first few characters are common for each pod (deployment name + replicaset ID) -- this would cause problem with gMSAs in that case as well
Just out of curiosity, instead of docker runtime, I directly used containerd and I could reproduce the problem there as well
Not specifying hostname when launching containers with same gMSA does not give this error, I believe the container runtime internally gives some random ID as the value for hostname in that case (scenario (1) above) -- that seem to imply the problem here is multiple container having same name ?

In context of containers with gMSA, having same name as gMSA name has been the norm for a while. Not specifying hostname isn't always possible, explicitly specifying hostname shouldn't break the status quo, and when using orchestration solutions, like the example I listed above, the user has no direct control on the value of hostname.

This issue has been severely restricting usage of Windows Containers at scale :(

ntrappe-msft commented 1 year ago

🔖 ADO 47828389

avin3sh commented 11 months ago

While we appreciate that the Containers team is still looking into this issue, I wanted to share some insights into just how seemingly difficult this problem is to work around.

In order to prevent requests landing on "bad" containers, I was trying to write custom aspnet core health check that could inquire the status of Trust Relationship of the container and mark the service as unhealthy when Domain Trust fails. What seemed to be a very straightforward tempory fix/compromise for our problems turned out to be a complex anomaly:

Firstly, netapi32 DLL is not available in nanoserver, and won't be until next major release of Windows Server - https://github.com/microsoft/Windows-Containers/issues/72#issuecomment-1569257600
If we have the Server Core image as the base image and have the DLL moved to the nanoserver container, we could work around this but only to run into more problems
Within the gMSA container - the Win32 call will not automatically pick the Netlogon Policy Server
And if you do hardcode a domain controller for this purpose, the netlogon query response would still indicate that the trust relationship exists (NERR_Success as opposed to something like RPC_S_SERVER_UNAVAILABLE) - and this is while the container is actively reporting trust errors while performing AD operations
And even if we had managed to get all of this to work, to "repair" the Secure Channel we would have to run our container as ContainerAdministrator which introduces bunch of other security concerns
PowerShell commands such as Test-ComputerSecureChannel simply fail, because the interpretation of "hostname" is different within a gMSA Container vs. outside of it - where the command is typically used
In essence, any of the means to [programmatically] catch gMSA and Domain Trust issues for Containers, like ones documented at https://kubernetes.io/docs/tasks/configure-pod-container/configure-gmsa/#troubleshooting, turned out to be unhelpful

My guesses for why the usual means to troubleshoot gMSA/Trust problems are not working for us is probably an attempted to fix a VERY SIMILAR problem for Containers in Server 2019:

We changed the behavior in Windows Server 2019 to separate the container identity from the machine name, allowing multiple containers to use the same gMSA simultaneously.

Since we do not understand how this was achieved, we have again reached a dead end and are desperately hoping the Containers team is able to solve our gMSA-Containers-At-Scale problem

ntrappe-msft commented 10 months ago

Thanks for the additional details. We've had a number of comments from internal and external teams struggling with the same issue. Our support team is still working to find a workaround that they can publish.

ntrappe-msft commented 9 months ago

Support team is still working on this. We'll make sure we also update our "troubleshoot gMSAs" documentation when we can address the problem.

israelvaldez commented 9 months ago

We're also running into this issue, we're using Windows Server 2019 container images, however there are no multiple container instances running with the same gMSA however we still get the same error about trust. Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

Update:

All of our containers have the same host name even if they run using different gMSAs
Using a different name for the containers does not solve the issue

avin3sh commented 8 months ago

Hello @ntrappe-msft - is Containers team in touch with the gMSA/CCG group. Our support engineers informed us that we are the only ones who have reported this issue, but based on your confirmation in https://github.com/microsoft/Windows-Containers/issues/405#issuecomment-1911045014, and assuming from reactions on this issue, it is clear there are many users who have run into this exact problem.

Our case is that we try to login with an AD user it doesn't work, but the gMSA does work, should I raise a ticket with support for assistance.

@israelvaldez, see my above comment. I would think it is worth highlighting this problem to Microsoft Support from your end as well, so that that it is obvious, without any doubt, that multiple customers face this and it could be appropriately prioritized (if not already)

WillsonAtJHG commented 8 months ago

Hi @ntrappe-msft we are also experiencing the same issue with our gMSA containers intermittently losing trusts with our domain and needs to be restarted. Wondering if Microsoft has any update on this issue.

We have multiple container instances running the same app and using the gMSA. Interestingly even though each of them have their own unique hostname defined, the log shows it's connecting to the DC using the gMSA name as MachineName. Host/domain/dc names replaced with **.

EventID : 5720 MachineName : gmsa Data : {138, 1, 0, 192} Index : 1309 Category : (0) CategoryNumber : 0 EntryType : Error Message : The session setup to the Windows Domain Controller \ for the domain ** failed because the computer gmsa does not have a local security database account. Source : NETLOGON ReplacementStrings : {\, , } InstanceId : 5720 TimeGenerated : 13/03/2024 10:23:24 AM TimeWritten : 13/03/2024 10:23:24 AM UserName : Site : Container :

ntrappe-msft commented 8 months ago

@avin3sh you are definitely not the only one experiencing this Issue. There are a number of internal teams who would like to increase the severity of this Issue and attention towards it. I'm crossing my fingers that we'll have a positive update soon. But it does help us if more people comment on this threat highlighting that they too are encountering this problem.

macsux commented 8 months ago

This is a huge issue for us at Broadcom with multiple fortunate 100 customers wanting this feature in one of our products and thousands of workloads being blocked from being migrated off VMs to containers

israelvaldez commented 8 months ago

In my scenario I created a new gMSA othern than the one I was using (which was not being used in multiple pods) and I was able to workaround this problem. i.e. my pod had gmsa1, I created gmsa2 and suddenly the trust betweent he pod and the domain was fine.

julesroussel3 commented 8 months ago

The workaround is appreciated, but we would like to see Microsoft fix this issue directly so that customers do not need to significantly redesign their environments.

avin3sh commented 7 months ago

This issue has been fortunate enough to not get attention of auto-reminder bots so far, but I am afraid they will be here anytime soon. I see this has been finally assigned, does it mean a fix is in the works ?

julesroussel3 commented 5 months ago

Please do not close this issue until the underlying technical problem has been resolved.On Jun 3, 2024, at 3:01 PM, microsoft-github-policy-service[bot] @.***> wrote: This issue has been open for 30 days with no updates. @riyapatel-ms, please provide an update or close this issue.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

avin3sh commented 4 months ago

We have started seeing a new issue with nanoserver images released April onwards (build 20348.2402+), the HTTP service running inside the container has started throwing 'System.Net.InternalException' was thrown. -1073741428, which, as per someone in the .NET platforms, translates to The trust relationship between the primary domain and the trusted domain failed. (see: https://github.com/dotnet/runtime/discussions/105567#discussioncomment-10161657)

As a result, all our new containers are failing to serve ANY incoming kerberized requests!! This is no longer intermittent. This is no longer about number of containers running simultaneously with a gMSA. This is straight up fatal error rendering the container pretty much unusable.

Now one would think "downgrading" to an older nanoserver image released prior to April would fix this ? Wrong. That would make the problem even more worse because of the another unresolved Windows-Containers issue - https://github.com/microsoft/Windows-Containers/issues/502 -- downgrading will potentially cause all the container infrastructure to BSOD!!!

To summarize,

the original issue remains unresolved!
April onwards, you can't use latest or even older nanoserver images
apps built off newer images are pretty much incapable of Kerberos
going back to images built off March or prior CUs has potential to cause your container host to go in BSOD

This issue desperately needs a fix. It's almost as if you can't use Windows Containers for any of your gMSA and Active Directory use cases anymore!

KristofKlein commented 3 months ago

We are facing also similar issue on the usage of gMSA within scaling windows containers. We also provide hostname into the container creation, but in fact due to gMSA containers are identify themself as the gMSA name. This leads to mismatches on our backend that tries to keep track on incoming traffic. It gets heavily confused while all request are coming from the same "machine". Of course, as long as I only have one container started making use of the one gMSA I am all good. the moment I scale it crashes. (fun fact: the product that gets confused is also from Microsoft :P)

So also curious what will happen to this :)

Ultimately, this is what kills me (from here )

Can't it put the container/hostname as suffix or so ? :D

NickVanRaaijT commented 2 months ago

We appear to maybe be facing a similair issue "The trust relationship between the primary domain and the trusted domain failed" on our AKS cluster. Is this being worked on?

vrapolinario commented 2 months ago

Quick question on the environment you folks have on which you are seeing this issue: Is NETBIOS enabled in your environment? NETBIOS uses port 137,138, and 139, with 139 being Netlogon. I have tested this with a customer (who was kind enough to validate their environment) on which a deployment with multiple pods worked normally. This customer has NETBIOS disabled and port 139 between pods/AKS cluster is blocked to the Domain Controllers.

I'm not saying this is a fix, but wanted to check if others see this error even with NETBIOS disabled or the port blocked.

avin3sh commented 2 months ago

From what I have found (I can do a more thorough test later), NETBIOS is disabled on the container host's primary interface and on the HNS management vNIC (we use Calico in VXLAN mode). However, the vNICs for individual pods show NETBIOS as enabled. We haven't done anything to block traffic on Port 139.

Do you suggest we perform a test after disabling NETBIOS on Pod vNICs as well; AND blocking Port 139 ? I am not sure how to configure this within CNI but perhaps I can write some script to disable netbios by making registry change after the container is network has come up, unless you have some script handy that you could share.

BTW just to reiterate the severity from my earlier comment https://github.com/microsoft/Windows-Containers/issues/405#issuecomment-2253067799 - nanoserver images after March 2024 have made this problem worse. Earlier the issue was intermittent and dependent on some environmental factors but March 2024+ nanoserver images are causing 100% failures.

vrapolinario commented 2 months ago

Thanks @avin3sh for the note. No need for a fancy script or worrying from the cluster/pod side - if you block port 139 at the network/NSG level, this should help validate. Again, I'm asking here as a validation, we haven't been able to narrow it down yet, but we have customers running multiple containers simultaneously with no errors and I noticed they have NETBIOS disabled AND port 139 blocked.

As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all?

avin3sh commented 2 months ago

Thank you so much for clarifying. I will share my observation after blocking traffic on port 139.

As for the Nano Server issue, can you please clarify: The issue happens even if you launch just one container? You're saying gMSA is not working on Nano Server at all?

We have a bunch of ASP.NET services. We use Negotiate/Kerberos authentication middleware. If I use an ASP.NET nanoserver image that is using Windows build from ~March~ April 2024 or later, the Kerberos token exchange is straight up failing and no request is able to get authenticated. You can see SSPI blob exchange functions in the error call stack - see here for the full call stack -> https://github.com/dotnet/runtime/discussions/105567#discussion-6980650

So essentially our web services are not able authenticate using negotiate when using any image from April or later. This does not happen if I launch just one container, but it happens 100% if there are multiple containers. I think I haven't seen this behavior in beefier windowserver image but can't say for sure as we don't generally use them due to their large size.

I have also seen varying behavior depending on whether the container user is ContainerUser or NT AUTHORITY\NetworkService - the issue exists in both the scenarios but manifests differently.

macsux commented 2 months ago

@avin3sh a little off topic, but you may want to look at my project that can seamlessly translate tokens from jwt to kerberos and vice versa. It's often used as sidecar and it doesn't require container to be domain joined - it uses kerberos.net library under the covers which is a managed implementation instead of relying on sspi.

https://github.com/NMica/NMica.Security

avin3sh commented 2 months ago

@vrapolinario I tried this with Port 139 blocked like so (for TCP, UDP, Inbound and Outbound):

New-NetFirewallRule -DisplayName "Block Port 139" -Direction Inbound -LocalPort 139 -Protocol TCP -Action Block

But the problem persisted.

Any chance the customer who tried this had large number of domain controllers in their environment ? We have seen that as long as your deployment replicas is less than or equal to number of domain controllers in the environment, you typically don't run into this issue.

avin3sh commented 2 months ago

We are happy to collaborate with you to test out various scenarios/experimental patches/etc. We already have a Microsoft Support case ongoing (@ntrappe-msft may be familiar) but it hasn't moved in several months - if you want to take a look at our case, more than willing to validate any suggestions that you may have for this problem.

vrapolinario commented 2 months ago

I believe I'm aware of the internal support case and I reached out to the team with this note as well. They are now running some internal tests as well, but I haven't heard back from them. The main thing I wanted you all here to please evaluate is if your environment is for some reason using NETBIOS. The fact that some of you reported the DCs getting the same hostname from the pod requests with a character limit of 15 tells me there's some NETBIOS communication happening.

https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/manage/dc-locator-changes

By default, DNS should be in use, so if you only see 15 characters in the hostnames going to the DCs, tells me something is off. By disabling NETBIOS or blocking port 139, you can quickly check if this helps solve the issue you are seeing.

avin3sh commented 2 months ago

Blocking TCP/139 hasn't helped. I also tried blocking UDP/137 and UDP/138 out of curiosity but that does not seem to have made any difference either.

I started a packet capture before even pods came up and reproduced the scenario, I don't see any packets going on TCP/139

There is bunch of chatter on TCP/135 - RPC - but of course I can't block it without disrupting other things on the host.

There are indeed RPC_NETLOGON (per wireshark) packets originating from the containers during this time, but that's over random high numbered ports, taking us back to my very first update. I believe this is just netlogon RPC happening over a port picked from 49152 - 65535

Let me know if you want me to try something else.

vrapolinario commented 2 months ago

Thank you for the validation. We actually ran the same test last night, but I didn't have time to reply here. I can confirm that blocking TCP 139 won't solve the problem. Microsoft still recommends moving away from NETBIOS unless you need it for compatibility, but this is not the issue here.

We're still investigating this internally and will report back.

As for the Nano Server image issue, can I ask you to please open a separate issue here so we can investigate? These seem like Teo separate problems that are unrelated. The fact that you can't make the nano server ima work at all indicates a different root cause.

avin3sh commented 2 months ago

As for the Nano Server image issue, can I ask you to please open a separate issue here so we can investigate? These seem like Teo separate problems that are unrelated. The fact that you can't make the nano server ima work at all indicates a different root cause.

@vrapolinario I have create a new issue https://github.com/microsoft/Windows-Containers/issues/537 with the exact steps to reproduce the bug. It's a simple aspnetcore web app with minimal api with Kerberos enabled. Given the error message is related to the domain trust failure and it does not happen when using NTLM but only Kerberos, I strongly feel it may be related to the larger gMSA issue being discussed here but I will wait for your analysis.

NickVanRaaijT commented 2 months ago

I've followed this guide on a new cluster https://learn.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/gmsa-aks-ps-module

It results in the same error 1786 0x6fa ERROR_NO_TRUST_LSA_SECRET

This is with a AD server running Windows Server 2016 and a AKS cluster with windows server 2019 nodes with GMSA enabled.

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been open for 30 days with no updates. @riyapatel-ms, please provide an update or close this issue.

ntrappe-msft commented 3 weeks ago

Sharing an update from the internal team.

We've released a new version of the Windows gMSA webhook. One of the new changes is creating a random hostname for gMSA which should help to get around the random domain trust failures. To use this:

Check that the pod spec hostname isn't set
Set randomHostname to be true

Thanks to @jsturtevant, @zylxjtu, and @AbelHu for this release.

avin3sh commented 1 week ago

Thank you @ntrappe-msft and everyone who worked on releasing the workaround. While I am still evaluating it and there are some error scenario that no longer occur after deploying the newer gmsa-webhook, I still see few inconsistencies.

537 is still an issue, unless I use nanoserver image released in March 2024 or prior. All the recent Windows 2022 nanoserver images don't work with NTAuth + aspnetcore unless I change my container user to `Network Service` instead of `ContainerUser`. Could you please provide clarification on what is the recommendation for ContainerUser vs NetworkService when using gMSAs ?

microsoft / Windows-Containers

Using gMSA on multiple containers simultaneously causes Domain Trust Relationship to fail #405