Open Leonardo-Ferreira opened 5 months ago
@JorTurFer / @wozniakjan do you have bandwidth to check this one?
thank you for reporting this, from the graphs this indeed looks like a memory leak, I can investigate later today.
after a quick investigation, this might actually be related to HTTP connections overhead, GC and request caching rather than an actual memory leak. I am able to reproduce quick memory growth with concurrent requests. When idle on a tiny cluster with a single HTTPScaledObject
, the interceptor
uses around 21Mi
$ kubectl top pods -nkeda --selector app.kubernetes.io/component=interceptor
NAME CPU(cores) MEMORY(bytes)
keda-add-ons-http-interceptor-6f6d78bd9-rkn2r 4m 21Mi
with requests sent sequentially, the memory usage barely increases and then goes down to idle levels. With simple benchmark using wrk
with 100 concurrent threads and 100 connections
$ wrk -t100 -c100 -d30s -H "Host: myhost2.com" http://172.19.0.2
the memory usage does increase, peaking at 54Mi
:
$ kubectl top pods -nkeda --selector app.kubernetes.io/component=interceptor
NAME CPU(cores) MEMORY(bytes)
keda-add-ons-http-interceptor-6f6d78bd9-rkn2r 494m 54Mi
but after wrk
benchmark terminates, the CPU usage drops down quickly and while the memory usage decreases rather slowly, both eventually reach similar levels to the initial idle. With 100 threads and 500 connections, interceptor OOMKills pretty fast with a default limit of 64Mi, after bumping the memory limit to 1Gi
, the memory again seems well flexible - growing as well as shrinking with benchmark executions.
Attached is also trace output from net/http/pprof
during the execution of the wrk benchmark. The memory growth is apparent but so is also the memory decrease. There doesn't appear to be any place in code where the memory just grows. I will perform some longer running benchmarks with profiling traces to try to spot something more sneaky.
I may have noticed something, with higher error rate in the benchmark, the memory consumption doesn't go down. I just bumped the memory limit for interceptor to 1Gi and executed benchmark that may have been too much, 5000 connections in 200 threads for 10 minutes
$ wrk -t200 -c5000 -d10m -H "Host: myhost2.com" http://172.19.0.2
Running 10m test @ http://172.19.0.2
200 threads and 5000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 688.19ms 506.07ms 2.00s 69.41%
Req/Sec 6.19 5.99 101.00 83.53%
351661 requests in 10.00m, 277.64MB read
Socket errors: connect 0, read 0, write 0, timeout 323052
Non-2xx or 3xx responses: 16769
Requests/sec: 586.00
Transfer/sec: 473.75KB
out of ~350000 requests, 91% timeouts and 4.7% errors. The memory usage is still at 531Mi
which is close to its ~600Mi
peak during the benchmark even though it's been idle for 10 minutes, and it doesn't seem to go down.
@wozniakjan my current number is around 15k requests every minute. I currently have 3 interceptor pods, at 640 MB each... in about one hour and half that number is expected to peak at 25k... during the late night, between 2am-5am, the number drops to 5k...
@Leonardo-Ferreira I played around with benchmarks a bit more and afaict interceptor
seems to be very CPU intensive, for wrk
simulated load that would match the request rate of 25k/min I am able to saturate 12 CPU cores easily. With this amount of CPU, it's able to maintain a relatively low error rate 0.42%.
With a low error rate, the memory usage appears to go down to idle levels eventually, but when the error rate is high, even when later the interceptor
is off traffic, memory usage doesn't seem to go down.
Go's pprof heap analysis doesn't point to any code that would accumulate memory and not release it, but also heap as counted by pprof appears to be just a fraction of memory used compared to cgroup accounting. The pprof recorded 44MB while cgroup show 318MB used.
@wozniakjan after looking deeply at my data, I could not correlate the memory usage/increase-rate to the number of errors. Luckily (or not) we had an issue this morning where a dependency was flaky so there was a significant spike in errors related to timeouts. on the memory consumption graph there was no significant change during the 1 hour long event...
simulated load that would match the request rate of 25k/min I am able to saturate 12 CPU cores easily
well, that's weird... our pods have a limit of 0.7
this is the last 24hrs looking at the 95th percentile of cpu usage
well, that's weird... our pods have a limit of 0.7
good to hear that, perhaps my artificial benchmark setup just ends up being CPU-intensive.
my current hypothesis I am trying to either prove or disprove is that maybe when the interceptor
itself fails to process the requests and it has a higher error rate, maybe then it leaves some resources dangling, e.g. open connections or entries in routing tables caches..
would that output a specific log message that I could query for?
I can see plenty of these access logs in the interceptor
2024-07-01T14:26:38Z INFO LoggingMiddleware 10.42.0.8:35634 - - [01/Jul/2024:14:26:36 +0000] "GET / HTTP/1.1" 502 11 "" ""
despite the fact the scaled app returns always OK 200
.
Also, I can observe these errors in the interceptor
logs
2024-07-01T14:27:13Z ERROR LoggingMiddleware.RoutingMiddleware.CountingMiddleware.UpstreamHandler.StaticHandler Bad Gateway {"routingKey": "//myhost2.com/", "namespacedName": {"name":"demo","namespace":"keda"}, "stream": "http://demo.keda:8080/", "error": "context canceled"}
github.com/kedacore/http-add-on/interceptor/handler.(*Static).ServeHTTP
github.com/kedacore/http-add-on/interceptor/handler/static.go:36
github.com/kedacore/http-add-on/interceptor/handler.(*Upstream).ServeHTTP.func2
github.com/kedacore/http-add-on/interceptor/handler/upstream.go:54
net/http/httputil.(*ReverseProxy).ServeHTTP
net/http/httputil/reverseproxy.go:472
github.com/kedacore/http-add-on/interceptor/handler.(*Upstream).ServeHTTP
github.com/kedacore/http-add-on/interceptor/handler/upstream.go:57
main.runProxyServer.newForwardingHandler.func2
github.com/kedacore/http-add-on/interceptor/proxy_handlers.go:87
net/http.HandlerFunc.ServeHTTP
net/http/server.go:2166
github.com/kedacore/http-add-on/interceptor/middleware.(*Counting).ServeHTTP
github.com/kedacore/http-add-on/interceptor/middleware/counting.go:39
github.com/kedacore/http-add-on/interceptor/middleware.(*Routing).ServeHTTP
github.com/kedacore/http-add-on/interceptor/middleware/routing.go:93
github.com/kedacore/http-add-on/interceptor/middleware.(*Logging).ServeHTTP
github.com/kedacore/http-add-on/interceptor/middleware/logging.go:42
github.com/kedacore/http-add-on/interceptor/middleware.(*Metrics).ServeHTTP
github.com/kedacore/http-add-on/interceptor/middleware/metrics.go:24
net/http.serverHandler.ServeHTTP
net/http/server.go:3137
net/http.(*conn).serve
net/http/server.go:2039
should we log some extra info and release a 0.8.1 version? unfortunately my security dpto does not allow me to use "unofficial" images in production
I'm still in the process of trying to figure out the root cause but as soon as I have some valid hypothesis to test, I can distribute a build for your testing
any news @wozniakjan, can I help somehow?
I didn't get very far and now I had to put it on hold temporarily, but I will get back to this soon
hey @wozniakjan, I'd like to contribute here. would you be willing to connect for like 30-45min so you can "boost" me? this way I could make significant contributions faster
sure, you can ping me on KEDA slack https://keda.sh/community/ https://kubernetes.slack.com/team/UF5R8HHB7
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.
This issue has been automatically closed due to inactivity.
@zroubalik reopen?
Report
I think we have a memory leak. Here is the memory usage over 2 days:
The memory drops is the pod getting OOM killed. At first we thought 64mb is too little, so we decided to experiment higher values in the last 24hrs, but the symptom persisted:
Expected Behavior
the pods should release memory and as traffic fluctuates down and not get OOM killed
Actual Behavior
Memory usage never decreases
Steps to Reproduce the Problem
Logs from KEDA HTTP operator
HTTP Add-on Version
0.8.0
Kubernetes Version
< 1.28
Platform
Microsoft Azure
Anything else?
AKS v 1.27.9