Open artem-zinnatullin opened 5 years ago
nginx will log something to the error log if it terminates; raising http://nginx.org/r/error_log severity might be needed.
I'm not sure how the "memory consumption" is calculated on your graphs - and it looks very wrong, since htop output shows nginx uses substantionally less. free(1) is useless there since the kernel does not virtualize this.
When the health check fails, what's with the network connectivity for the container? anything in tcpdump?
anything in dmesg output on the nodes where this happens?
nginx will log something to the error log if it terminates; raising http://nginx.org/r/error_log severity might be needed.
Got it, increased to warn
level, rolling out now, will update once we get container restarted
Weird thing is that it exits with code 0, I've never seen Nginx doing that on its own.
I'm not sure how the "memory consumption" is calculated on your graphs - and it looks very wrong, since htop output shows nginx uses substantionally less.
We collect it with prometheus, which should be getting it from container runtime (we use cri-o, not Docker).
Yeah I agree that it's weird that htop output doesn't add up.
free(1) is useless there since the kernel does not virtualize this.
Yep, just wanted to see the output of it, at least it shown shared mem region that matched 10m we set for keys_zone
in Nginx.
When the health check fails, what's with the network connectivity for the container? anything in tcpdump?
That I'll have to collect more info on, prob not tcpdump.
anything in dmesg output on the nodes where this happens?
Will comment once we get new restarts.
Btw, forgot to mention that our k8s Ingress is also Nginx, but without caching and it's able to handle same traffic with no problems, I'm wondering if that's due to our caching setup.
It also doesn't happen in our staging environment where we run containers with exact same config, they just don't get prod traffic.
Thanks for jumping on this, @thresheek, really appreciate it 👍 If you have other suggestions we should try, please share, I'll get back with requested logs and data once we collect it.
Reading https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt it seems the kernel counts page cache as "used" for a cgroup, which might explain the "used" size on your graphs.. I would also think some k8s watchdog daemon (if not OOM?) sends the kill signal if it reaches a bad enough state.
Update on logs:
We got 3 + 4 + 0 restarts (3 pods) so far, one of them has this in log right before termination:
2019/01/31 19:08:57 [alert] 1#1: worker process 8 exited on signal 9
No other warnings, alerts or errors. Nothing interesting in dmesg log.
Also kubectl describe
doesn't indicate OOM, k8s is pretty informative about it when it terminates containers, we just see Terminated (Completed) and exit code 0
.
Probably not OOM then.
Reading kernel.org/doc/Documentation/cgroup-v1/memory.txt it seems the kernel counts page cache as "used" for a cgroup, which might explain the "used" size on your graphs.. I would also think some k8s watchdog daemon (if not OOM?) sends the kill signal if it reaches a bad enough state.
Interesting. Page caching is feature of kernel right, probably it's aware of memory limit we set with cgroup and just floats around it. I guess we just don't see that on other deployments because most of them don't work with disk.
Not sure if Nginx can do much about it if it just reads from disk
Do you think it could be this https://trac.nginx.org/nginx/ticket/1163?
When Nginx exceeds max_size of disk cache, in our case we set it to 10g
and we mount volume with 11g.
I wonder if it gets to close to 11g and worker process hangs, that could explain timeouts we observe on health checks right before termination.
When the health check fails, what's with the network connectivity for the container? anything in tcpdump?
Network activity is low, just couple requests within 60 seconds before termination. Unfortunately I can't easily get tcpdump on prod machines atm.
@artem-zinnatullin It seems that we're facing a similar issue. Did you manage to find a solution?
Not really, it still restarts once in a while with exit code 0 and no useful log messages unfortunately, but i think increasing memory requirements for the container helped reduce the frequency of these restarts
On Wed, Dec 9, 2020 at 11:24 PM Smaine Kahlouch notifications@github.com wrote:
@artem-zinnatullin https://github.com/artem-zinnatullin It seems that we're facing a similar issue. Did you manage to find a solution?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nginxinc/docker-nginx/issues/303#issuecomment-742299616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHMDXASKT4VKD4F5E5XCUTSUBZTNANCNFSM4GSHUDRA .
it still restarts once in a while with exit code 0
I feel like the answer to why the nginx container is restarting is in the initial description and Kubernetes config.
Nginx times out on a static return 200 health check sometimes
livenessProbe: httpGet: path: "/nginx-health" port: 80 initialDelaySeconds: 1 periodSeconds: 3 failureThreshold: 1 timeoutSeconds: 1
When a probe fails, Kubernetes will try
failureThreshold
times before giving up. Giving up in case of liveness probe means restarting the container.worker process 8 exited on signal 9
nginx
is not terminating itself, but it does look like one of the worker processes was sent SIGKILL
(finding the source of that might tell why). The config tells Kubernetes to stop and start container if it fails liveness twice in a row (i.e. one retry). This could easily happen if the nginx worker process that was handling that liveness request was somehow killed (and perhaps the other workers are still recovering seconds later). Is the container running out of open fds or something that would cause a process to be killed by the kernel?
Possibly similar to https://github.com/docker-library/php/issues/1048
FWIW we are observing inexplicable restarts with exit code 0 without k8s (we are running in docker-compose). In our deployment nginx is used as a very lightweight API proxy and its memory requirements are minimal. Absolutely no clue as to what could be causing this as it happens randomly, under otherwise completely normal conditions.
The same problem
The same problem
FWIW, I've recently seen reports of some containerized software like mysql using excessive amounts of memory and that reproducing across systems was inconsistent because it depended on what was configured for LimitNOFILE
in docker.service
(systemd unit file, systemctl cat docker.service
) which sometimes is assigned the value of infinity
which again can depend on how the OS has been configured to resolve that value. On some systems this is over 1 billion, while typically the host is half a million or lower.
If you are experiencing this memory problem, you can try see if it persists when you run the container with the ulimit settings of your host: --ulimit "nofile=$(ulimit -Sn):$(ulimit -Hn)"
(if you use docker-compose, there is similar config option in the docs). The soft limit (-Sn
) should be 1024 usually, and the hard limit (-Hn
) sets the maximum number of files permitted AFAIK, which software can opt-in up to from the soft limit default.
Other software can stall and appear to be hanging, but under heavy CPU activity due to iterating through that entire range to close any open file descriptors as a common practice for daemon services. I've seen this delay start-up from software that'd usually take less than a second to needing 8-70 minutes. I've read reports of a build pipeline taking 10 hours instead of 15 minutes, and of situations where memory usage skyrockets. So it could potentially be the same problem you're experiencing?
I am seeing this also Memory limit I actually have the memory limit set but dosent seem to work (Unraid) --memory=1G --no-healthcheck --log-opt max-size=50m
Setup
We run
docker.io/nginx:1.15.8
containers with simple proxy + disk cache configuration:It runs as Kubernetes pods:
Observations
We observe 3 problems with this setup.
Problem 1: Nginx container terminates itself sometimes
We see Nginx container terminating itself with code
0
with no warnings or errors in logs.Logs just have some regular requests it serves all the time.
Problem 2: Nginx times out on a static return 200 health check sometimes
We see Nginx container timing out (1 second) very very simple health check:
We used to log it, but nothing stood out when we saw time outs.
Problem 3: Nginx containers consume ~96% of given memory all the time
We're giving container 1GB of memory as limit. For some reason Nginx consumes pretty much all of it on production load which is not anything special — 60 RPS or so usually and keeps consuming it in after hours when we don't get any requests.
Moreover, we can't figure out what consumes memory in the container:
I've blamed shared memory first, but it seems to be fine too:
Exactly
10
mb I set for cache key_zone…Expectations
0
We're happy to provide additional information.
And we're really curious on finding out where is this memory consumption coming from, the only thing left on mind is Linux page cache…
I've spent hours reading everything I could find, these links seem to be the only ones close to what we observe:
But still I can't figure how to fix it.
All problems seem to be caused by Problem №3: memory consumption of the container