Closed natemccurdy closed 5 months ago
Looks like there are issues with Helm files clobbering each other in concurrent renders.
$ git clean -fdx
Removing saas/components/argo/cd/vendor/
Removing saas/components/cert-manager/vendor/
Removing saas/components/crossplane/controller/vendor/
Removing saas/components/eks-pod-identity-webhook/vendor/
Removing saas/components/external-secrets/vendor/
Removing saas/components/istio/base/vendor/
Removing saas/components/istio/mesh/cni/vendor/
Removing saas/components/istio/mesh/gateway/vendor/
Removing saas/components/istio/mesh/istiod/vendor/
Removing saas/components/login/zitadel-server/vendor/
$ holos render platform ./platform
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/login/zitadel-certs cluster=management num=4 total=64 duration=1.828944088s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/istio/mesh/httpbin/routes cluster=aws1 num=19 total=64 duration=1.8991467s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/pgo/controller cluster=aws1 num=21 total=64 duration=2.18369136s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/argo/creds cluster=aws1 num=31 total=64 duration=1.130460137s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/argo/routes cluster=aws1 num=30 total=64 duration=1.089624699s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/istio/mesh/cni cluster=aws2 num=40 total=64 duration=3.965683892s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/cert-letsencrypt cluster=management num=2 total=64 duration=1.124516594s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/eso-creds-manager cluster=management num=1 total=64 duration=1.384222009s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/ecr-creds-manager cluster=management num=5 total=64 duration=1.242500902s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/login/zitadel-secrets cluster=aws1 num=22 total=64 duration=1.116147967s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/external-secrets cluster=aws2 num=34 total=64 duration=4.541646733s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/eks-pod-identity-webhook cluster=management num=6 total=64 duration=2.81272254s
10:41AM INF platform.go:45 ok render component version=0.83.1 path=components/istio/mesh/iap/authpolicy cluster=aws2 num=52 total=64 duration=1.296626354s
10:41AM ERR could not execute version=0.83.1 code=unknown err="could not rename: rename /Users/nate/src/holos-run/holos-infra/saas/components/external-secrets/vendor1071240550/external-secrets /Users/nate/src/holos-run/holos-infra/saas/components/external-secrets/vendor/external-secrets: file exists" loc=helm.go:159
10:41AM ERR could not execute version=0.83.1 code=unknown err="could not render component: exit status 1" loc=platform.go:40
From Slack:
So, I think we:
- Document the only thing that should ever write to the vendor directory is the
cacheChart
method.- Document our assumption direct sub-directories are moved into place atomically.
- Handle the error by logging it at debug level instead of returning an error and then continue with that for loop.
The temp directory is already cleaned up so should be pretty minimal change. Might also be worth a comment this is the reason I put the temp directory in the same directory as the destination. (edited) That way it's guaranteed to be on the same filesystem, renames aren't atomic across filesystems like from /tmp to /home
This adds a new flag,
--concurrency <int>
, toholos render platform
.Default concurrency is set to
min(runtime.NumCPU(), 8)
, which is the lesser of 8 or the number of CPU cores. In testing, I found that past 8, there are diminishing or negative returns due to memory usage of rendering each component.In practice, this reduced rendering of the SAAS platform components from ~90s to ~23 on my 12-core MacBook Pro.
This run uses the default concurrency value of
8
: