kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.5k stars 8.26k forks source link

0.19: Logs full of "omiting metrics for the request, current batch is full while logging request" #3023

Closed JordanP closed 6 years ago

JordanP commented 6 years ago

Hi, I wanted to try the recent v0.19 on my production setup but quickly after the rollout the logs became full of messages like [lua] monitor.lua:71: call(): omiting metrics for the request, current batch is full while logging request, client: ..... This triggered 100% CPU usage of Fluentd (of course....) I had to rollback to v0.18 with --enable-dynamic-configuration=false.

Those boxes should only do 1k RPS max, and each have 2 Nginx workers so this -- if an Nginx worker processes more than (MAX_BATCH_SIZE/FLUSH_INTERVAL) RPS then it will start dropping metrics shouldn't apply in theory.

Any idea what might be wrong ? Thanks !

Running in Kubernetes cluster version v1.9+ (v1.9.7-gke.5) - git (clean) commit 9b635efce81582e1da13b35a7aa539c0ccb32987 - platform linux/amd64

aledbf commented 6 years ago

@JordanP can you post 5s of the logs? What you are describing is strange because the stats (lua) are flushed every second, so you need more than 1000rps to fill the queue

JordanP commented 6 years ago

See attached file (4 MB uncompressed logs.tar.gz). You will need some grep'ing and wc'ing :)

fntlnz commented 6 years ago

@JordanP I'm not experiencing this behavior in 0.19, I deployed it as soon as it was released and I never had one of those lines I saw in your log.

I'm curious to see what's in your nginx-configuration config map to see if I can trigger this behavior.

ElvinEfendi commented 6 years ago

@JordanP do you see "error when setting up timer.every" in the logs when Nginx starts? (since you stripped the logs I can not tell). By this I wanna make sure the timer that periodically flushes the queue is configured.

When you see these logs can you check Nginx workers processes and see if they are "shutting down"? Since we don't flush the queue when the Lua timer expires while Nginx shutting down it could be that you have long running connections that makes an Ngin worker to stay in "shutting down" state for long time. Now that I think more about it I think we should just always flush the queue regardless of the state of Nginx worker and let the socket communication fail if it will. UPDATE: thinking more about this, I don't see how it can be an issue since when a worker is shutting down it won't be getting any new request.

JordanP commented 6 years ago

I did not really strip the logs (beside the http referer), I just let the container only run for a couple of seconds because with more than 1MB/s of logs, it's not stable. The log were captured right after the container started, so I guess no worker processes must have been in "shutting down" state. I'll double check tomorrow with a new log capture.

Something that comes in mind is that I do a pretty heavy usage of websocket (like ~ 10k WS msg per s) but I don't see how it could have an impact.

My config map is pretty standard I guess:

  generate-request-id: "false"
  disable-access-log: "true"
  max-worker-connections: "120000" 
  use-gzip: "false"
  enable-brotli: "true"
  use-geoip: "false"
  worker-processes: "2"
  worker-shutdown-timeout: "11m"
  keep-alive: "650"
  keep-alive-requests: "10000"
  server-tokens: "false"
ElvinEfendi commented 6 years ago

@JordanP can you test index.docker.io/shopify/nginx-ingress-controller:debug_3023? It has the following patch:

> ingress-nginx (master)$ git diff
diff --git a/rootfs/etc/nginx/lua/monitor.lua b/rootfs/etc/nginx/lua/monitor.lua
index 512ef46e..6fe03d39 100644
--- a/rootfs/etc/nginx/lua/monitor.lua
+++ b/rootfs/etc/nginx/lua/monitor.lua
@@ -4,7 +4,7 @@ local assert = assert

 local metrics_batch = {}
 -- if an Nginx worker processes more than (MAX_BATCH_SIZE/FLUSH_INTERVAL) RPS then it will start dropping metrics
-local MAX_BATCH_SIZE = 10000
+local MAX_BATCH_SIZE = 1000000
 local FLUSH_INTERVAL = 1 -- second

 local _M = {}
@@ -40,6 +40,7 @@ end

 local function flush(premature)
   if premature then
+    ngx.log(ngx.INFO, "skipping flushing, premature timer expiry")
     return
   end

@@ -57,6 +58,7 @@ local function flush(premature)
   end

   send(payload)
+  ngx.log(ngx.INFO, "successfully flushed")
 end

 function _M.init_worker()
@@ -68,7 +70,7 @@ end

 function _M.call()
   if #metrics_batch >= MAX_BATCH_SIZE then
-    ngx.log(ngx.WARN, "omiting metrics for the request, current batch is full")
+    ngx.log(ngx.WARN, "omiting metrics for the request, current batch is full: " .. tostring(#metrics_batch))
     return
   end

also please set error-log-level: "info"

JordanP commented 6 years ago

Great thanks ! I'll test this as early as Monday morning !

JordanP commented 6 years ago

@ElvinEfendi so I tested this image. Attached are the logs. What's very weird is that the logs don't contain neither the new lines added in the patch nor the previous "omiting metrics for the request" error... So I am not sure what's happening.

(Note that's there was a unexpected shutdown of one of the upstream server hence a ton of error near the end of the log file) logs.tar.gz

JordanP commented 6 years ago

Hum my bad, now I see the errors "omiting metrics for the request, current batch is full: 1000000". I just needed to wait... more :) Lemme re-upload a new full log capture

JordanP commented 6 years ago

Here's the log. The "omiting metrics for the request, current batch is full: 1000000" errors start after 1h and 13min then it's a constant stream of these messages. logs.tar.gz

It"s as if the "flush" function was never called.

Edit : I am very sorry, indeed flush was not called because init_worker was not called because I am using a custom nginx/template/nginx.tmpl and did not update it with https://github.com/kubernetes/ingress-nginx/commit/2207d7694d980fcde77b15d9fdac5e0faa98258a#diff-980db9e4b88704f12338bd074839f94e Sorry I wasted your time ! :(

JordanP commented 6 years ago

Yeah, I confirm everything is fine and works as expected. Very nice CPU usage decrease thanks to metering batching, very measurable on my side. Good job.