moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.49k stars 18.63k forks source link

.Net ASP.Net Webapp in Container loose Primary Domain Trust randomly after some days runtime #37459

Open daBONDi opened 6 years ago

daBONDi commented 6 years ago

.Net ASP.Net and .Net Core ASP Webapplication in Container loose Primary Domain Trust randomly after some days of runtime

All Containers use the same GSMA Account(Credspec File), and Applications authenticate with kerberos.

After runtime of like 7 - 14 Days randomly or restart of a big bunch of container ~15 some application loose primary domain trust

In the Container itself, the Domain Trust seems to be there.

Steps to get the Application working again

Restart the Container Instance(sometimes a double restart is needed)

Or we restart netlogon service in the container, then it starts working again!

Steps to reproduce the issue:

Hard do reproduce, when we restart all containers, randomly some of them got the domain trust error

Describe the results you received:

web application throw error on .net call .IsUserInRole(xxx), doesnt matter if code is in Core or full framework and its also not .Net Version Depending

Got follow trace

An unhandled exception has occurred: The trust relationship between this workstation and the primary domain failed,
System.ComponentModel.Win32Exception (0x80004005): The trust relationship between this workstation and the primary domain failed
,at System.Security.Principal.NTAccount.Translate(IdentityReferenceCollection sourceAccounts, Type targetType, Boolean& someFailed)
,at System.Security.Principal.NTAccount.Translate(IdentityReferenceCollection sourceAccounts, Type targetType, Boolean forceSuccess)
,at System.Security.Principal.WindowsPrincipal.IsInRole(String role)

Describe the results you expected:

Domain Trust is stable with GMSA Accounts

Output of docker version:

Client:
 Version:      18.05.0-ce
 API version:  1.30 (downgraded from 1.37)
 Go version:   go1.9.5
 Git commit:   f150324
 Built:        Wed May  9 22:12:05 2018
 OS/Arch:      windows/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      17.06.2-ee-14
  API version:  1.30 (minimum version 1.24)
  Go version:   go1.8.7
  Git commit:   6345dd7
  Built:        Thu Jun 21 18:28:51 2018
  OS/Arch:      windows/amd64
  Experimental: false

Output of docker info:

Containers: 21
 Running: 21
 Paused: 0
 Stopped: 0
Images: 23
Server Version: 17.06.2-ee-14
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd json-file logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 14393 (14393.2339.amd64fre.rs1_release_inmarket.180611-1502)
Operating System: Windows Server 2016 Standard
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: fj-v-docker1
ID: P3GE:EPMA:BCI4:XQTE:XHYN:AG2V:XZ37:627A:HTI3:3O5Y:GKKG:A2WX
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Updated Windows Server 2016 in a VMWare Cluster all Containers with multiple .Net Framwork and .Net Core Containers

Provision the container with ansible - docker_container module

All Containers have a dedicated IP with transparent docker network

Reference: https://social.msdn.microsoft.com/Forums/en-US/542286aa-1095-4af0-91f2-43ff0c0f9469/docker-container-with-gsma-loose-trust-relationship-after-a-few-days?forum=windowscontainers

daBONDi commented 5 years ago

I rebuild all my images to Server 2019 LTSC Container on new Server 2019 Docker Host VMs like descibe above.

Problem is better on a factor of 1000 :-) but still sometimes i loose trust on container starts like

jorisscheppers commented 5 years ago

Do you use a time synchronization tool to have both the container host and the DC sync their clocks? A difference in clocks can account for the behavior that you describe.

daBONDi commented 5 years ago

i got an internal NTP atom clock Server but the docker hosts and the AD all syncing on the pdc and the pdc sync with a ntp atom clock server.

I don't think it is a kerberos time skew issue, because other docker .net containers on the same host have not the issue, or have the issue dependss if the crash, its so random.

When the container not working the nltest /parentdomain is ok, so the container should should have the trust.

And with Server 2019 Core and 2019 Core Docker Image its more stable, but error is still there.

Maybe its a problem that i got 20+ Containers running under the same GMSA Account?

Do the Docker Container needs a NTP Source when the Dockerhost is Domainjoined and has the DC as NTP Source?

faheem556 commented 4 years ago

We are also experiencing this issue. @daBONDi did you get it to work as expected?

daBONDi commented 4 years ago

No we still have this error but its not as often if we use win2019 😟

You have any success on this?

We are thinking it has to be with something like this describe here but dont had the time to verify it

https://boyan.io/kerberos-load-balancers/

daBONDi commented 4 years ago

@Faheemitian you also reverseproxy with haproxy infront?

faheem556 commented 4 years ago

@daBONDi Nope. We have interlock in front and it fully supports the Kerberos according to their documentation.

daBONDi commented 4 years ago

@memonfaheem You find anything after a near a year?

bsosnader commented 3 years ago

@daBONDi have you found any more information about this?

daBONDi commented 3 years ago

no sorry still investigating you got same Problem? and als no intel from ms or docker

Am 25.11.2020 18:00 schrieb Brenden Sosnader notifications@github.com:

@daBONDihttps://github.com/daBONDi have you found any more information about this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/moby/moby/issues/37459#issuecomment-733830345, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADRLPKVEDUJ3L7U5BV524QTSRUZZ3ANCNFSM4FKAAUXA.

bsosnader commented 3 years ago

Hey @daBONDi , do your containers all use the GMSA account name they are running as as their hostname? This was previously a requirement, but as of Windows Server 2019 it is no longer needed. In fact, setting the container hostnames to all be the same may be what causes this issue. If you are able, try allowing the hostnames to be automatically set by docker to the container ID instead of set to the GMSA account.

daBONDi commented 3 years ago

Thx for your Input @bsosnader

We upgraded to 2019 because of the issue you mention.

All Containers got a different Containername defined by an Ansible Playbook, i need to have so the Playbook can identify the instances. All Containers share the same GMSA Account Credentials ans running on 2 VMs Maybe thats the Problem?

For each Application Containers we run them on both Docker hosts(VM). Both Dockerhosts have permission for the GMSA and the GMSA Account have the required SPNs for the Service, and we use a dedicated Container IP Adresses for each Container.

If the Problem are the IPs because of Kerberos Auth cannot validarte SPN Hostname or something they will not work after Start right?

But they work and after an unkown time they loose the trust.

Will try Random Containernames in an Testenvironment

Any other Ideas?

daBONDi commented 3 years ago

And each Container have different Name on both host like

App1-1,App1-2 or something, so on both dockerhosts the containernames are unique