Closed PleachiM closed 5 years ago
HealthCheck in the dockerfile (https://github.com/Microsoft/nav-docker/blob/master/generic/DOCKERFILE) is defined as
HEALTHCHECK --interval=30s --timeout=10s CMD [ "powershell", ".\Run\HealthCheck.ps1" ]
so the healthcheck.ps1 in c:\run in the container must fail. HealthCheck calls checkhealth.ps1 (which you can override in c:\run\my)
CheckHealth.ps1 will connect to the healthcheck endpoint of the web client (if installed) using the publicwebbaseurl. If Web client isn't installed it checks whether the service tier is running. https://github.com/Microsoft/nav-docker/blob/master/generic/Run/CheckHealth.ps1
I would call this a perfect explanation. Thank you very much. I will go deeper into this.
When I start my containers .. now there are 16 just for testing and connect them do our On-Prem SQL Server then after container 12, all containers become unhealthy. Seems to be a resource problem. That's why I took the SQL Express out of this game.
Now I also added --cpus=3 I will do more testing on this to figure out what' exactly is causing this issue. Even it could be a problem with one of my scripts because all containers are created with different port settings in this test case.
@freddydk What do you think should be possible on a D8s_v3 Azure machine? Just out of your stomach.
Windows Server 2019 (Server Core) Machine: D8s_v3 (8 vCPU, 32GBRam) with Premium There is nothing else except portainer running on this machine.
When I connect to one of those 20 containers. x1-x20 and try ..
$result = Invoke-WebRequest -Uri "http://x3/NAV/System" -UseBasicParsing -TimeoutSec 10
I get .. .
StatusCode : 200
StatusDescription : OK
Content : {"result":true}
RawContent : HTTP/1.1 200 OK
Pragma: no-cache
Transfer-Encoding: chunked
X-Frame-Options: SAMEORIGIN
Content-Security-Policy: frame-ancestors 'self'
X-Content-Type-Options: nosniff
Cache-Control: no-store,no...
Forms :
Headers : {[Pragma, no-cache], [Transfer-Encoding, chunked], [X-Frame-Options, SAMEORIGIN], [Content-Security-Policy, frame-ancestors 'self']...}
Images : {}
InputFields : {}
Links : {}
ParsedHtml :
RawContentLength : 15
I have no idea what else I can try.
Container Healthcheck script is fine it seems.
But it seems that the docker service on the host is not able to do the health check for 20 containers in this interval.
HEALTHCHECK --interval=30s --timeout=10s CMD [ "powershell", ".\Run\HealthCheck.ps1" ]
Edit:
Ouch… Didn't think that that would be a problem. I will have to think about that one.
Seems to be a known issue.
It looks like you can override the healthcheck settings on docker run:
docker run --name=test -d \ --health-cmd='stat /etc/passwd || exit 1' \ --health-interval=2s \ busybox sleep 1d
found here: https://docs.docker.com/engine/reference/run/#healthcheck
Maybe set the interval higher - or the timeout higher…
You would have to add this to AdditionalParameters
I also found this here. Don't get me wrong. Workarounds are fine but this machine should really be able to handle 20 Containers without SQLExpress on it. That's just my opinion.
I will try to override the HEALTHCHECK but at the moment I don't know how to set the script and shell from parameter.
But I found this one and when you have a look at the latest post please this seems to be quite up-to-date.
@PleachiM Am I reading this correctly that you try to run 20 containers on 8 cores and 32GB RAM? That would be .4 cores and 1.6 GB RAM per container, not even including what Windows Server needs as host. I don't think that is a setup you can expect to work reliably and fast. In my experience with "load business application" enabled a NST takes that amount of RAM after startup without any usage
Well for development when not every container is used at the same time... Thats common for traditional installations at least in our company. To be fair you can stop unused containers but still the issue remains with adequate hardware for 20 containers i guess? Our company will move all possible dev/test nsts to docker soon so this could be a real problem. And disabling healthcheck seems a bad workaround...
We have up to 25 containers running on a bigger machine and the healthchecks work there, so I don't think this is a general problem
The problems started with around 11 Containers. Then I began to play around with different amounts of containers and different settings. All my 20 containers are idle. From time to time I connect to them just to make sure, that they are still responding. But this machine is purely for testing. I will do some measurements tomorrow.
At the moment we have 25GB Ram in use. The CPU is running crazy. But there is nothing (except the Healthcheck) which should cause high CPU load.
Is there a way to deactivate the Healthcheck? With no PS execution.
Recreated all my containers with --no-healthcheck=true
and now I have 85% Memory Usage and around 50-60% CPU usage.
Before I regularly had 100% CPU peaks.
So is this https://github.com/moby/moby/issues/33096 and issue for us?
Note: When working with the Web client there is no real CPU increase at all.
But when I connect to the container and execute
powershell -command (Measure-command { powershell -command exit}).totalSeconds
CPU gets again 100% usage on my host.
I can definitely repro the same behavior with our images, and after reading the thread, I will do some tests. Our 2019 images are based off servercore and all other images are based off dotnetframework. All descriptions are 1+ years old but state that servercore has the issue and dotnetframework not.
There doesn't seem to be any difference between dotnet core and servercore images with ltsc2019 - but spawning powershell on these images does seem to be heavy on the CPU. Will email some of the people who were involved in the thread higher up
Just tried this with process isolation - spawning powershell takes 0,1 second and doesn't cause any CPU spikes. Using hyperv isolation, it takes almost a second and causes CPU spike. I will close this as unrelated as I don't think there is anything I can do about this.
Can someone point me in the right direction? I have issues when I create multiple containers on our Azure Container Host. Container Host: Windows Server 2019 (Server Core) Machine: D8s_v3 (8 vCPU, 32GBRam) with Premium Storage. At the moment I'm working on trying different Port Configurations to figure out why this unhealthy status appears. There are about 20 containers (with integrated SQLEXPRESS at the moment)
Can you please tell me how this Unhealthy status will be measured.
Forgot to mention that there is nothing in the logs and I even can connect through shell to my container. NAV Service is up and running.