solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 434 forks source link

gateway-proxy (envoy) pod increases memory (heap_size) over time with steady state config #8472

Open bdecoste opened 1 year ago

bdecoste commented 1 year ago

Gloo Edge Version

1.13.x

Describe the bug

When Gloo is in a config steady state and no traffic the gateway proxy (envoy) pod has been observed to increase memory usage (heap_size) over time. This has only been observed in 1.12 and 1.13 and was not observed in previous versions.

Steps to reproduce the bug

See attached scripts. Create 200 VS, RT, and Upstream sets. Deploy an echo service. Port-forward the gateway-proxy 19000. Run scale-echo.sh. This scales the echo service up and down (generating eDS updates) and scales the gloo pod up and down (resetting the xDS connection).

Over time I see the following progression:

{
 "allocated": "46823624",
 "heap_size": "60817408",
 "pageheap_unmapped": "0",
 "pageheap_free": "5267456",
 "total_thread_cache": "121464",
 "total_physical_bytes": "65067946"
}
{
 "allocated": "47083968",
 "heap_size": "65011712",
 "pageheap_unmapped": "0",
 "pageheap_free": "9609216",
 "total_thread_cache": "121464",
 "total_physical_bytes": "69270442"
}
{
 "allocated": "47372064",
 "heap_size": "69206016",
 "pageheap_unmapped": "0",
 "pageheap_free": "11411456",
 "total_thread_cache": "121464",
 "total_physical_bytes": "73464746"
}
{
 "allocated": "47514472",
 "heap_size": "71303168",
 "pageheap_unmapped": "0",
 "pageheap_free": "11485184",
 "total_thread_cache": "121464",
 "total_physical_bytes": "75561898"
}

Expected Behavior

gateway-proxy (envoy) pod maintains steady state memory usage over time with steady state config and no traffic.

scripts.tar.gz

sam-heilbron commented 1 year ago

@bdecoste Are you able to reproduce this without resetting the xDS connection (restarting Gloo)?

github-actions[bot] commented 2 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.