jupyterhub / configurable-http-proxy

node-http-proxy plus a REST API
BSD 3-Clause "New" or "Revised" License
239 stars 128 forks source link

Memory leak in proxy? #388

Open snickell opened 2 years ago

snickell commented 2 years ago

Aloha, we've been seeing a pattern of growing daily memory usage (followed by increasing slugishness then non-responsiveness above around 1-2GB of RAM) in the 'proxy' pod:

image

The different colors are fresh proxy reboots, which have been required to keep the cluster running.

Screen Shot 2022-04-07 at 5 01 00 AM

-Seth

snickell commented 2 years ago

Sorry, clipped the units:

image

The pattern is nearly identical on the other cluster.

snickell commented 2 years ago

We're running z2jh chart version 1.1.3-n354.h751bc313 (I believe the latest ~3 weeks ago), but as you can see, this pattern predates this chart version by quite a bit.

snickell commented 2 years ago

We start seeing serious performance problems at about 1.5GB, which is suspiciously close to the heap limit for node 🤔 So maybe its a memory leak that then cascade fails at the heap limit into some sort of .... garbage collection nightmare? or?

manics commented 2 years ago

Do you happen to know if the memory increases are correlated with particular events, e.g. a user starting a new server, or connecting to a particular service?

snickell commented 2 years ago

No, but I'm looking into it, my vague suspicion: websockets? We push them pretty hard, e.g. many users are streaming VNC over websocket. Is there a log mode that has useful stats about e.g. the routing table?

snickell commented 2 years ago

OK, so a further development, since high RAM usage correlated with performance problems, I added a k8s memory limit to the pod, thinking it would get killed when it passed 1.4GB of RAM, and reboot fresh, a decent-ish workaround for now.

Here's what happened instead:

image

Note that there's one other unusual thing here, I kubectl exec'ed several 200MB "ram balloon" processes to try to push it over the edge faster for testing. They clearly didn't work haha, and I doubt that's why this is not growing at the normal leakage rate, but worth mentioning.

Did something change or did adding a k8s memory limit suddenly change the behavior?

snickell commented 2 years ago

(note this otherwise consistent memory growth pattern goes back to jan, and a number of version upgrades since from the z2jh chart..... this is.... weird)

consideRatio commented 2 years ago

Hmmm, so when rhe pod restarts, is it because it has been evicted from a node, or is it because it has restarted its process within the container etc?

Being evicted from a node can happen based on external logic, while managing memory within the container can happen because of more internal logic, which can be enabled by limits to clairfy it needs to not surpass certain limits.

Need to learn more about OOMkiller things within the container vs by the kubelet etc, but perhaps you ended up helping it avoid getting evicted by surpassing its memory limit. Hmmm..

rcthomas commented 2 years ago

@snickell was what you observed related to load at all? Like, on weekend days do you observe this behavior? We're currently experiencing relatively-speaking high load on our deployment, and I observe something similar. Memory consumption in the proxy will just suddenly shoot up and it becomes non-responsive. Are you still using CHP for your proxy? I am considering swapping it for Traefik in the coming days here.

consideRatio commented 2 years ago

@snickell have you experienced this with older versions of z2jh -> chp as well?

marcelofernandez commented 1 year ago

Still happening on the latest version (v4.5.6).

shaneknapp commented 3 months ago

see also https://github.com/jupyterhub/configurable-http-proxy/issues/434

i believe the socket leak is the root cause of the memory leak. on our larger, more active hubs we've seen constant spiking of the chp ram under "load", and chp running out of heap space: https://github.com/jupyterhub/configurable-http-proxy/issues/434#issuecomment-1959901854

"load" is ~300+ users logging in around the "same time".

"same time" is anywhere from 15m to a couple of hours.

i don't believe that increasing the chp heap size is the correct fix, as the memory/socket leak still needs to be addressed. however, increasing it may help, but that would need some experimentation.

marcelofernandez commented 3 months ago

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

shaneknapp commented 3 months ago

thanks, good to know. we've also been considering this as well.

On Fri, Jun 7, 2024 at 7:10 AM Marcelo Fernández @.***> wrote:

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

— Reply to this email directly, view it on GitHub https://github.com/jupyterhub/configurable-http-proxy/issues/388#issuecomment-2154927965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMIHLEJDD5VQBI52PJGGKTZGG5L3AVCNFSM6AAAAABI5H3Y42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUHEZDOOJWGU . You are receiving this because you commented.Message ID: @.***>

consideRatio commented 3 months ago

@marcelofernandez are you able to share config for your setup?

shaneknapp commented 3 months ago

We finally replaced chp with traefik in our z2jh deployment, and this problem got obviously fixed.😬

Check out that alternative just in case you are experiencing this.

echoing @consideRatio -- do you have any relevant traefik config bits you could share? this would be super useful! :)

thanks in advance...

marcelofernandez commented 3 months ago

Hey guys, sure!

First, and foremost, I'm sorry I can't give you all the details of my company's internal PR because:

That said, I can give you an overview of what I did.

The complicated part was that it seemed like nobody did this in the past, so I based my job on this (far more ambitious) previous and rejected PR which originally was aimed to replace both proxies:

The only thing I did (because I only wanted stability) based on that PR was to:

Based on the Z2JH's architectural graph, here are the changes.

Before: image

After: image

Once I defined the idea of what I wanted, I had to drop unneeded code from the PR above, configure the hub to call the proxy in the same pod (http://localhost:8081) and that's it.

I implemented this like a year and a half ago, if you have more questions, just let me know...

Regards

manics commented 1 month ago

4.6.2 was released 2 months ago with a fix for the leaking sockets. Is there still a memory leak or can we close this issue?

shaneknapp commented 1 month ago

@manics i don't think we should close this yet... we still saw chp run out of nodejs heap on hubs w/lots of traffic and usage even after we deployed 4.6.2, but since summer is slow it hasn't bitten us yet.

i'm sure that within a few weeks we'll see OOMs/socket leaks once the fall term ramps up.

minrk commented 1 month ago

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

shaneknapp commented 5 days ago

If anyone can make a stress test to provoke this, ideally with just CHP (or the JupyterHub Proxy API, like the traefik proxy benchmarks) I can test if the migration to http2-proxy will help. I tried a simple local test with a simple backend and apache-bench, but many millions of requests and hundreds of gigabytes later, I see no significant increase in memory or socket consumption (still sub-100MB). So there must be something relevant in typical use (websockets, connections dropped in a particular way, adding/removing routes, etc.) that a naïve benchmark doesn't trigger.

re benchmarking... we really don't have either the cycles, available staff or deep understanding of how the proxy works to do this.

re the something: we're seeing mildly improved performance w/4.6.2 but are still experiencing pretty regular, albeit much shorter (and self-recovering) outages at "peak"[1] usage.

[1] peak can be anywhere from ~200 up to ~800 users on a hub.

for example, last night between 845p and 9p, we had ~188 students logged on to datahub (the lowest end of 'peak') and saw the proxy peg at 100% CPU and 1.16G ram.

image

image

hub cpu hovered around ~40% until the outage, and during that 15m dropped to nearly 0%. hub memory usage was steady at around ~477M

only during the 15m of the outage (~845p - 9p) our chp logs were full of entries like this:

chp-error

not surprisingly, the readiness probes couldn't find either the hub or proxy during this outage (and also the hub just past when things recovered?):

chp-hub-down

i'll dig more through the logs and see what needles i can winnow out of the haystacks.

shaneknapp commented 5 days ago

running dmesg -T on the core node where this proxy was running indeed shows that chp is still getting OOMKilled:

[Wed Sep 11 21:48:22 2024] nginx invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=995
[Wed Sep 11 21:48:22 2024] CPU: 11 PID: 3160400 Comm: nginx Not tainted 6.1.90+ #1
[Wed Sep 11 21:48:22 2024] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024
[Wed Sep 11 21:48:22 2024] Call Trace:
[Wed Sep 11 21:48:22 2024]  <TASK>
[Wed Sep 11 21:48:22 2024]  dump_stack_lvl+0x4a/0x70
[Wed Sep 11 21:48:22 2024]  dump_header+0x52/0x250
[Wed Sep 11 21:48:22 2024]  oom_kill_process+0x10a/0x220
[Wed Sep 11 21:48:22 2024]  out_of_memory+0x3dc/0x5c0
[Wed Sep 11 21:48:22 2024]  ? mem_cgroup_iter+0x1c6/0x240
[Wed Sep 11 21:48:22 2024]  try_charge_memcg+0x827/0xa90
[Wed Sep 11 21:48:22 2024]  charge_memcg+0x3f/0x1f0
[Wed Sep 11 21:48:22 2024]  __mem_cgroup_charge+0x2b/0x80
[Wed Sep 11 21:48:22 2024]  handle_mm_fault+0xf80/0x16b0
[Wed Sep 11 21:48:22 2024]  do_user_addr_fault+0x271/0x4d0
[Wed Sep 11 21:48:22 2024]  exc_page_fault+0x78/0xf0
[Wed Sep 11 21:48:22 2024]  asm_exc_page_fault+0x22/0x30
[Wed Sep 11 21:48:22 2024] RIP: 0033:0x5cd7083be658
[Wed Sep 11 21:48:22 2024] Code: 10 e8 1c 5e 00 00 49 89 87 08 02 00 00 48 85 c0 0f 84 24 02 00 00 49 8b 8f e8 01 00 00 48 85 c9 74 20 31 d2 66 0f 1f 44 00 00 <80> 48 0a 01 49 8b 8f e8 01 00 00 48 83 c2 01 48 83 c0 60 48 39 ca
[Wed Sep 11 21:48:22 2024] RSP: 002b:00007fff48f73f70 EFLAGS: 00010283
[Wed Sep 11 21:48:22 2024] RAX: 00007ba90b837030 RBX: 00007ba90bad2050 RCX: 0000000000004000
[Wed Sep 11 21:48:22 2024] RDX: 000000000000112a RSI: 0000000000180000 RDI: 00005cd709921ea0
[Wed Sep 11 21:48:22 2024] RBP: 00007ba90a329380 R08: 00000000000000fe R09: 0000000000000000
[Wed Sep 11 21:48:22 2024] R10: 00007ba90b950ffc R11: 0000000000000006 R12: 00005cd70856a530
[Wed Sep 11 21:48:22 2024] R13: 00005cd7084eb121 R14: 00005cd7084eb12e R15: 00007ba90a329380
[Wed Sep 11 21:48:22 2024]  </TASK>
[Wed Sep 11 21:48:22 2024] memory: usage 2097168kB, limit 2097152kB, failcnt 806
[Wed Sep 11 21:48:22 2024] swap: usage 0kB, limit 9007199254740988kB, failcnt 0

this repeated regularly for about 30m, and then for another 30m lots of messages like this:

[Wed Sep 11 22:38:28 2024] Tasks in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc982fb8d_dd48_4bff_a016_38cda0c5767a.slice/cri-containerd-58fd1ffa87a46e002e02ede50e42cf51b21de27dc6ffccf8f22032be3bcc2f80.scope are going to be killed due to memory.oom.group set
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3223198 (dumb-init) total-vm:220kB, anon-rss:8kB, file-rss:0kB, shmem-rss:0kB, UID:101 pgtables:24kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3223258 (nginx) total-vm:155116kB, anon-rss:37304kB, file-rss:7056kB, shmem-rss:756kB, UID:101 pgtables:188kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3225302 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3226667 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227348 (nginx) total-vm:171012kB, anon-rss:48508kB, file-rss:5728kB, shmem-rss:2528kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227349 (nginx) total-vm:172136kB, anon-rss:49644kB, file-rss:5728kB, shmem-rss:2492kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227350 (nginx) total-vm:172080kB, anon-rss:49560kB, file-rss:5728kB, shmem-rss:2544kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227351 (nginx) total-vm:171792kB, anon-rss:49336kB, file-rss:5728kB, shmem-rss:2504kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227382 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227423 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3227567 (nginx) total-vm:171948kB, anon-rss:49412kB, file-rss:5728kB, shmem-rss:2520kB, UID:101 pgtables:220kB oom_score_adj:995
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227592 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227680 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227713 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3227746 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228024 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228067 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3228303 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3230430 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: OOM victim 3230824 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:28 2024] Memory cgroup out of memory: Killed process 3230826 (nginx) total-vm:170948kB, anon-rss:48460kB, file-rss:5728kB, shmem-rss:4652kB, UID:101 pgtables:216kB oom_score_adj:995
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231122 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231172 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231288 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231347 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231814 (nginx) is already exiting. Skip killing the task
[Wed Sep 11 22:38:29 2024] Memory cgroup out of memory: OOM victim 3231921 (nginx) is already exiting. Skip killing the task

during all of this, the chp logs are full of thousands of 503 GET or 503 POST errors for active users attempting to get work done. :\

so: something is still amiss. chp is running out of heap space. there's absolutely a memory/socket leak remaining somewhere.

consideRatio commented 5 days ago

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.

This is normal memory killing i think, so what memory request is configured for the peoxy pod?

Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: https://github.com/jupyterhub/grafana-dashboards/issues/128

Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash?

shaneknapp commented 5 days ago

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s.

This is normal memory killing i think, so what memory request is configured for the peoxy pod?

      resources:
        requests:
          cpu: 0.001
          memory: 64Mi
        limits:
          memory: 1.5Gi

Note that i think the graph you have in grafana may represent an average combination of pods if you have multiple proxy pods in the k8s cluster, so then you could see memory usage below the requested amount even though an individual pod goes above it. I recall an issue opened about this... Found it: jupyterhub/grafana-dashboards#128

nope -- this is only one deployment, not a sum of them all.

Is normal memory still growing without bound over time as users comes and goes, and chp getting memory killed for that reason, making requesting more memory just a matter of gaining time before a crash?

it seems to grow over time, and as it surpasses the node max heap cpu will inevitably ramp up before things start getting OOMKilled.

shaneknapp commented 5 days ago

We have discussed two kinds of memory, network tcp memory, and normal ram memory - getting normal memory killed is a consequence of surpassing requested memory via k8s. This is normal memory killing i think, so what memory request is configured for the peoxy pod?

      resources:
        requests:
          cpu: 0.001
          memory: 64Mi
        limits:
          memory: 1.5Gi

fwiw i'm about to bump this to 3Gi "just to see".

shaneknapp commented 5 days ago

this just happened again and it really looks like we're running out of ephemeral ports...

shaneknapp commented 5 days ago

we're actually thinking about putting our chp pods in their own pool, one proxy per (very small) node to get past this until a fix or better solution comes along.

shaneknapp commented 5 days ago

currently on the impacted proxy node:

/srv/configurable-http-proxy $ netstat -natp|wc -l
24536
shaneknapp commented 5 days ago

hmm, well, this sheds some light on things: https://github.com/jupyterhub/configurable-http-proxy/issues/465

some of our hubs have > 1000 users.

felder commented 4 days ago

Noticed that when this behavior happens at berkeley we see accompanying logs indicating, EADDRNOTAVAIL and tons of connection failures to the hub pod.

Really like the issue described here https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/

https://devops-insider.mygraphql.com/zh-cn/latest/network/tcp/ephemeral-port/ephemeral-port-exhaustion-and-how-to-avoid-it.html

Running netstat -natp on the chp proxy indicates it’s very possible we have enough connections to be running out of ephemeral ports.

Would likely explain the sudden increase in cpu as well because I think proxy rapidly retries when this behavior starts.