Closed JordanP closed 6 years ago
@JordanP can you post 5s of the logs? What you are describing is strange because the stats (lua) are flushed every second, so you need more than 1000rps to fill the queue
See attached file (4 MB uncompressed logs.tar.gz). You will need some grep'ing and wc'ing :)
@JordanP I'm not experiencing this behavior in 0.19
, I deployed it as soon as it was released and I never had one of those lines I saw in your log.
I'm curious to see what's in your nginx-configuration
config map to see if I can trigger this behavior.
@JordanP do you see "error when setting up timer.every" in the logs when Nginx starts? (since you stripped the logs I can not tell). By this I wanna make sure the timer that periodically flushes the queue is configured.
When you see these logs can you check Nginx workers processes and see if they are "shutting down"? Since we don't flush the queue when the Lua timer expires while Nginx shutting down it could be that you have long running connections that makes an Ngin worker to stay in "shutting down" state for long time. Now that I think more about it I think we should just always flush the queue regardless of the state of Nginx worker and let the socket communication fail if it will. UPDATE: thinking more about this, I don't see how it can be an issue since when a worker is shutting down it won't be getting any new request.
I did not really strip the logs (beside the http referer), I just let the container only run for a couple of seconds because with more than 1MB/s of logs, it's not stable. The log were captured right after the container started, so I guess no worker processes must have been in "shutting down" state. I'll double check tomorrow with a new log capture.
Something that comes in mind is that I do a pretty heavy usage of websocket (like ~ 10k WS msg per s) but I don't see how it could have an impact.
My config map is pretty standard I guess:
generate-request-id: "false"
disable-access-log: "true"
max-worker-connections: "120000"
use-gzip: "false"
enable-brotli: "true"
use-geoip: "false"
worker-processes: "2"
worker-shutdown-timeout: "11m"
keep-alive: "650"
keep-alive-requests: "10000"
server-tokens: "false"
@JordanP can you test index.docker.io/shopify/nginx-ingress-controller:debug_3023
?
It has the following patch:
> ingress-nginx (master)$ git diff
diff --git a/rootfs/etc/nginx/lua/monitor.lua b/rootfs/etc/nginx/lua/monitor.lua
index 512ef46e..6fe03d39 100644
--- a/rootfs/etc/nginx/lua/monitor.lua
+++ b/rootfs/etc/nginx/lua/monitor.lua
@@ -4,7 +4,7 @@ local assert = assert
local metrics_batch = {}
-- if an Nginx worker processes more than (MAX_BATCH_SIZE/FLUSH_INTERVAL) RPS then it will start dropping metrics
-local MAX_BATCH_SIZE = 10000
+local MAX_BATCH_SIZE = 1000000
local FLUSH_INTERVAL = 1 -- second
local _M = {}
@@ -40,6 +40,7 @@ end
local function flush(premature)
if premature then
+ ngx.log(ngx.INFO, "skipping flushing, premature timer expiry")
return
end
@@ -57,6 +58,7 @@ local function flush(premature)
end
send(payload)
+ ngx.log(ngx.INFO, "successfully flushed")
end
function _M.init_worker()
@@ -68,7 +70,7 @@ end
function _M.call()
if #metrics_batch >= MAX_BATCH_SIZE then
- ngx.log(ngx.WARN, "omiting metrics for the request, current batch is full")
+ ngx.log(ngx.WARN, "omiting metrics for the request, current batch is full: " .. tostring(#metrics_batch))
return
end
also please set error-log-level: "info"
Great thanks ! I'll test this as early as Monday morning !
@ElvinEfendi so I tested this image. Attached are the logs. What's very weird is that the logs don't contain neither the new lines added in the patch nor the previous "omiting metrics for the request" error... So I am not sure what's happening.
(Note that's there was a unexpected shutdown of one of the upstream server hence a ton of error near the end of the log file) logs.tar.gz
Hum my bad, now I see the errors "omiting metrics for the request, current batch is full: 1000000". I just needed to wait... more :) Lemme re-upload a new full log capture
Here's the log. The "omiting metrics for the request, current batch is full: 1000000" errors start after 1h and 13min then it's a constant stream of these messages. logs.tar.gz
It"s as if the "flush" function was never called.
Edit : I am very sorry, indeed flush
was not called because init_worker
was not called because I am using a custom nginx/template/nginx.tmpl
and did not update it with https://github.com/kubernetes/ingress-nginx/commit/2207d7694d980fcde77b15d9fdac5e0faa98258a#diff-980db9e4b88704f12338bd074839f94e
Sorry I wasted your time ! :(
Yeah, I confirm everything is fine and works as expected. Very nice CPU usage decrease thanks to metering batching, very measurable on my side. Good job.
Hi, I wanted to try the recent v0.19 on my production setup but quickly after the rollout the logs became full of messages like
[lua] monitor.lua:71: call(): omiting metrics for the request, current batch is full while logging request, client: ....
. This triggered 100% CPU usage of Fluentd (of course....) I had to rollback to v0.18 with--enable-dynamic-configuration=false
.Those boxes should only do 1k RPS max, and each have 2 Nginx workers so this
-- if an Nginx worker processes more than (MAX_BATCH_SIZE/FLUSH_INTERVAL) RPS then it will start dropping metrics
shouldn't apply in theory.Any idea what might be wrong ? Thanks !
Running in Kubernetes cluster version v1.9+ (v1.9.7-gke.5) - git (clean) commit 9b635efce81582e1da13b35a7aa539c0ccb32987 - platform linux/amd64