travisghansen / argo-cd-helmfile

Integration between argo-cd and helmfile
MIT License
213 stars 55 forks source link

context deadline exceeded / unknown sync #24

Closed kfirfer closed 1 year ago

kfirfer commented 2 years ago

Hello

Sometimes in ArgoCD I can see this message in apps:

/usr/local/bin/helmfile --helm-binary /usr/local/bin/helm --no-color --allow-no-matching-release --namespace chaos-testing repos Adding repo chaos-mesh https://charts.chaos-mesh.org in ./helmfile.yaml: command "/usr/local/bin/helm" exited with non-zero status: PATH: /usr/local/bin/helm ARGS: 0: /usr/local/bin/helm (19 bytes) 1: repo (4 bytes) 2: add (3 bytes) 3: chaos-mesh (10 bytes) 4: https://charts.chaos-mesh.org (29 bytes) 5: --force-update (14 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: Error: context deadline exceeded COMBINED OUTPUT: Error: context deadline exceeded

They are in unknown sync , and after few minutes (1-3) its synced and back to normal

Its doesn't matter which app, it could be any

What cause it to happen ?

travisghansen commented 2 years ago

That's a good question :( I usually only see that when my git endpoint goes down, otherwise I don't really see that error.

How many helm repos are you loading up?

kfirfer commented 2 years ago

I have around 50 helm repos and 60 apps

its happens usually when im changing the values or adding new charts when the system is static, its not occurs that much

tpatrascu-flowx commented 2 years ago

Hello,

Could be that helm/helmfile build/template takes a long time, so I think you can fix this with:

https://github.com/argoproj/argo-cd/blob/master/docs/user-guide/config-management-plugins.md#using-a-cmp

"!!! important If your sidecar CMP command runs too long, the command will be killed, and the UI will show an error. The CMP server respects the timeouts set by the server.repo.server.timeout.seconds and controller.repo.server.timeout.seconds items in argocd-cm. Increase their values from the default of 60s.

Each CMP command will also independently timeout on the ARGOCD_EXEC_TIMEOUT set for the CMP sidecar. The default is 90s. So if you increase the repo server timeout greater than 90s, be sure to set ARGOCD_EXEC_TIMEOUT on the sidecar."

kfirfer commented 2 years ago

I have set this values in argocd chart:

controller:
  args:
    statusProcessors: "40"
    operationProcessors: "20"
    selfHealTimeout: "10"
    repoServerTimeoutSeconds: "600"

and env var ARGOCD_EXEC_TIMEOUT to 10m in reposerver

i will test it and inform if its helped

travisghansen commented 2 years ago

How many argo apps do those 60 repos and 50 apps span?

kfirfer commented 2 years ago

How many argo apps do those 60 repos and 50 apps span?

What do you mean by span ?

I have set this values in argocd chart:

controller:
  args:
    statusProcessors: "40"
    operationProcessors: "20"
    selfHealTimeout: "10"
    repoServerTimeoutSeconds: "600"

and env var ARGOCD_EXEC_TIMEOUT to 10m in reposerver

i will test it and inform if its helped

No it doesnt helped

kfirfer commented 2 years ago

Mabye I will try to setup helm chart proxy (e.g. harbor) And point any repository to the proxy What do you think ?

travisghansen commented 2 years ago

I mean how many instances of argocd app are in the cluster? Is it all 1 giant app in the same ns or are there 50 argocd apps?

kfirfer commented 2 years ago

There are 50 argocd apps across around 25 namespaces

Each namespace have exactly 2 argocd apps , one for helmfile and one for manifests (e.g. namespace , resource quotas etc..) Each helmfile have around 1-5 helm charts (depends what the namespace does)

In this cluster we have around 60+ helm charts releases, 50 argocd apps and 25 namespaces

travisghansen commented 2 years ago

Ok that helps! Do you put all the helm repos in a base that all helmfiles use? Or does each helmfile only pull down the repos it actually uses?

Said differently, does each argo app end up refreshing/syncing all 50 repos?

kfirfer commented 2 years ago

no it is separated , theres no base for the repos each helmfile have his own repos

travisghansen commented 2 years ago

Hmm, then the number shouldn’t impact you too much. I’m not sure what would be taking so long. Do they take a long time to render/template locally on your workstation?

You may try bumping the number of repo server pods to alleviate the pressure and scale out a little bit.

kfirfer commented 2 years ago

I have set the replicas of repo server to 2 instead 1 Will check how its works now and update

kfirfer commented 2 years ago

@travisghansen The unknown sync issue is still exists unforthently

Somehow, only when I change values in some app (no matter which one), other apps getting the unknown sync problem

Its seems like argocd is trying to sync the apps no matter if they unchanged in the git

They also cant download the charts for some period of time:

/tmp/helmfile3211379699/dex-dex-values-6487f67968 (49 bytes) 346: --kube-version=1.23 (19 bytes) 347: --api-versions=acme.cert-manager.io/v1 (38 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: Error: failed to download "dex/dex" at version "0.9.0" COMBINED OUTPUT: Error: failed to download "dex/dex" at version

Mabye caching the helm charts in repo server could solve it ?

travisghansen commented 1 year ago

Any luck figuring this out?

kfirfer commented 1 year ago

I haven’t yet set up and tried proxy helm repositories

I have set the monitoring treshold to an hour(not ideal) but it ignores it for now until i’ll test it deeper