Closed jeffmccune closed 4 months ago
It might be the way holos render
loops over cue instances when the user specifies /...
at the end of the build arguments causes memory to balloon.
A quick fix might be, "don't use /...
" and instead, call holos render
as a separate process for each holos component.
Is there a memory ballooning problem if each holos component were rendered as individual commands instead of one catch all
I'll leave it here with you nate. I think my hypothesis in the above comment is probably the quick band-aid fix.
The hypothesis holds. When running with /...
the garbage collector struggles to keep up and the goal balloons up to ~10GiB used
GODEBUG=gctrace=1 holos render --cluster-name=k2 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/foundation/cloud/...
However, we can use the --print-instances
flag I added in PR #148 to spread these out into multiple processes:
holos render --cluster-name=k2 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/foundation/cloud/... --print-instances \
| GODEBUG=gctrace=1 xargs -t -P1 -I% holos render --cluster-name=k2 % 2>&1 | tee foo.txt
https://gist.github.com/jeffmccune/bf8f634f7462b1916e7a0d3383e3d354
Quick scan of this gist provides some insight. Some components don't take much memory at all.
prod-mesh-gatewaytakes a lot, 2463 MB goal at the end
prod-platform-obs is only 58 MB goal, maybe it completely bypasses the projects structures?
etc...
Overall though it is a quick win to spread the components out, the largest balloon is around 2 GiB instead of 10GiB.
In Slack, Jeff mentioned that https://github.com/holos-run/holos/blob/v0.70.0/docs/examples/platforms/reference/clusters/foundation/cloud/mesh/mesh.cue#L11 might be a big contributor to memory requirements.
That is the single auth proxy we use for everything I bet that's the culprit.
I processed the GC logs while rendering each cluster's individual instances and found that the foundation/cloud/mesh/
and provisioner/projects
paths are, by an order of magnitude, the biggest memory hogs.
Using Git bisect, I found the following two commits to be the main causes of memory issues. The criteria for identifying the bad commits are a render for a single instance that took more than 30s or had a GC goal >= 1000 MB.
provisioner
cluster while rendering the provisioner/projects
path.foundation/cloud/mesh/
paths.I wasn't able to make headway on improving memory usage in the Go or Cue code. I tried a few suggestions like removing unneeded let
declarations, but nothing had a noticeable affect on GC goals.
I did update hack/render-all
in the holos-infra repo so that running it doesn't consume more than ~2GB of memory per render in https://github.com/holos-run/holos-infra/commit/8aefcb12c31bc034d6a0aa15ecb341743993a3e6 . This uses the new --print-instances
flag so that each holos render
executed by xargs
is smaller and uses less memory than collecting an entire cluster's platform into one render.
I'm going to stop here as this is good enough for now.
While researching this, I found that Cuelang has a lot of open issues about performance issues and memory leaks, and that this is an active area of development and interest by the Cue developers. It seems like future versions of Cue might end up fixing memory problems for us, so I recommend we try new Alpha and Beta versions of Cue as they are released.
v0.9.0-alpha.2
and higher, we can set CUE_EXPERIMENT=evalv3
to use the new, more performant, evaluator.Another interesting possible follow-up to this issue is looking at Unity, Cue's automated performance and regression testing framework:
Closed via #179 and #183
Problem:
On my 32GiB workstation with 1G swap, the following command results in multiple processes consuming over 30% of system memory. The linux oom killer kicks in and starts sending kill -9's to random processes.
Solution:
???
Result: We have a way to limit memory usage. It's acceptable to run
holos
in parallel, but we need to get usage under 4Gi otherwise we won't be able to run it inside of pods with reasonable resource limits in place.Where to start
#ProjectHosts
is brutal and enumerates all hosts for a project#EnvHosts
is brutal but it shouldn't be used much since it was the first stab forhttpbin
- project-template.cue