microsoft / nav-docker

Official Microsoft repository for Dynamics NAV in Docker resources. It has not been decided yet, to which extend Microsoft will ship Docker images with NAV, so everything in this repo is work in progress and might be subject to deletion.
MIT License
179 stars 92 forks source link

Containers Unhealty #315

Closed PleachiM closed 5 years ago

PleachiM commented 5 years ago

Can someone point me in the right direction? I have issues when I create multiple containers on our Azure Container Host. Container Host: Windows Server 2019 (Server Core) Machine: D8s_v3 (8 vCPU, 32GBRam) with Premium Storage. At the moment I'm working on trying different Port Configurations to figure out why this unhealthy status appears. There are about 20 containers (with integrated SQLEXPRESS at the moment)

Can you please tell me how this Unhealthy status will be measured.

Forgot to mention that there is nothing in the logs and I even can connect through shell to my container. NAV Service is up and running.

freddydk commented 5 years ago

HealthCheck in the dockerfile (https://github.com/Microsoft/nav-docker/blob/master/generic/DOCKERFILE) is defined as

HEALTHCHECK --interval=30s --timeout=10s CMD [ "powershell", ".\Run\HealthCheck.ps1" ]

so the healthcheck.ps1 in c:\run in the container must fail. HealthCheck calls checkhealth.ps1 (which you can override in c:\run\my)

CheckHealth.ps1 will connect to the healthcheck endpoint of the web client (if installed) using the publicwebbaseurl. If Web client isn't installed it checks whether the service tier is running. https://github.com/Microsoft/nav-docker/blob/master/generic/Run/CheckHealth.ps1

PleachiM commented 5 years ago

I would call this a perfect explanation. Thank you very much. I will go deeper into this.

PleachiM commented 5 years ago

When I start my containers .. now there are 16 just for testing and connect them do our On-Prem SQL Server then after container 12, all containers become unhealthy. Seems to be a resource problem. That's why I took the SQL Express out of this game.

Now I also added --cpus=3 I will do more testing on this to figure out what' exactly is causing this issue. Even it could be a problem with one of my scripts because all containers are created with different port settings in this test case.

@freddydk What do you think should be possible on a D8s_v3 Azure machine? Just out of your stomach.

Windows Server 2019 (Server Core) Machine: D8s_v3 (8 vCPU, 32GBRam) with Premium There is nothing else except portainer running on this machine.

PleachiM commented 5 years ago

When I connect to one of those 20 containers. x1-x20 and try ..

$result = Invoke-WebRequest -Uri "http://x3/NAV/System" -UseBasicParsing -TimeoutSec 10

I get .. .

StatusCode        : 200
StatusDescription : OK
Content           : {"result":true}
RawContent        : HTTP/1.1 200 OK
                    Pragma: no-cache
                    Transfer-Encoding: chunked
                    X-Frame-Options: SAMEORIGIN
                    Content-Security-Policy: frame-ancestors 'self'
                    X-Content-Type-Options: nosniff
                    Cache-Control: no-store,no...
Forms             :
Headers           : {[Pragma, no-cache], [Transfer-Encoding, chunked], [X-Frame-Options, SAMEORIGIN], [Content-Security-Policy, frame-ancestors 'self']...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        :
RawContentLength  : 15

I have no idea what else I can try.

PleachiM commented 5 years ago

Container Healthcheck script is fine it seems.

But it seems that the docker service on the host is not able to do the health check for 20 containers in this interval.

HEALTHCHECK --interval=30s --timeout=10s CMD [ "powershell", ".\Run\HealthCheck.ps1" ]

Edit:

Health check exceeded timeout (10s)

freddydk commented 5 years ago

Ouch… Didn't think that that would be a problem. I will have to think about that one.

PleachiM commented 5 years ago

Seems to be a known issue.

https://github.com/moby/moby/issues/33933

freddydk commented 5 years ago

It looks like you can override the healthcheck settings on docker run:

docker run --name=test -d \ --health-cmd='stat /etc/passwd || exit 1' \ --health-interval=2s \ busybox sleep 1d

found here: https://docs.docker.com/engine/reference/run/#healthcheck

Maybe set the interval higher - or the timeout higher…

You would have to add this to AdditionalParameters

PleachiM commented 5 years ago

I also found this here. Don't get me wrong. Workarounds are fine but this machine should really be able to handle 20 Containers without SQLExpress on it. That's just my opinion.

I will try to override the HEALTHCHECK but at the moment I don't know how to set the script and shell from parameter.

But I found this one and when you have a look at the latest post please this seems to be quite up-to-date.

https://github.com/moby/moby/issues/33096

tfenster commented 5 years ago

@PleachiM Am I reading this correctly that you try to run 20 containers on 8 cores and 32GB RAM? That would be .4 cores and 1.6 GB RAM per container, not even including what Windows Server needs as host. I don't think that is a setup you can expect to work reliably and fast. In my experience with "load business application" enabled a NST takes that amount of RAM after startup without any usage

marknitek commented 5 years ago

Well for development when not every container is used at the same time... Thats common for traditional installations at least in our company. To be fair you can stop unused containers but still the issue remains with adequate hardware for 20 containers i guess? Our company will move all possible dev/test nsts to docker soon so this could be a real problem. And disabling healthcheck seems a bad workaround...

tfenster commented 5 years ago

We have up to 25 containers running on a bigger machine and the healthchecks work there, so I don't think this is a general problem

PleachiM commented 5 years ago

The problems started with around 11 Containers. Then I began to play around with different amounts of containers and different settings. All my 20 containers are idle. From time to time I connect to them just to make sure, that they are still responding. But this machine is purely for testing. I will do some measurements tomorrow.

At the moment we have 25GB Ram in use. The CPU is running crazy. But there is nothing (except the Healthcheck) which should cause high CPU load.

Is there a way to deactivate the Healthcheck? With no PS execution.

PleachiM commented 5 years ago

Recreated all my containers with --no-healthcheck=true and now I have 85% Memory Usage and around 50-60% CPU usage. Before I regularly had 100% CPU peaks. So is this https://github.com/moby/moby/issues/33096 and issue for us?

PleachiM commented 5 years ago

Note: When working with the Web client there is no real CPU increase at all. But when I connect to the container and execute powershell -command (Measure-command { powershell -command exit}).totalSeconds CPU gets again 100% usage on my host.

freddydk commented 5 years ago

I can definitely repro the same behavior with our images, and after reading the thread, I will do some tests. Our 2019 images are based off servercore and all other images are based off dotnetframework. All descriptions are 1+ years old but state that servercore has the issue and dotnetframework not.

freddydk commented 5 years ago

There doesn't seem to be any difference between dotnet core and servercore images with ltsc2019 - but spawning powershell on these images does seem to be heavy on the CPU. Will email some of the people who were involved in the thread higher up

freddydk commented 5 years ago

Just tried this with process isolation - spawning powershell takes 0,1 second and doesn't cause any CPU spikes. Using hyperv isolation, it takes almost a second and causes CPU spike. I will close this as unrelated as I don't think there is anything I can do about this.