tilt-dev / tilt-extensions

Extensions for Tilt
https://tilt.dev/
Apache License 2.0
201 stars 159 forks source link

`helm_resource` did not wait for `resource_deps` on `helm_repo` #500

Open nschmeller opened 1 year ago

nschmeller commented 1 year ago

Hi all, I bumped into a race condition recently in a CI run where helm_resource was supposed to wait for helm_repo to finish but did not. Here are some logs from what happened:

vm-charts │ Initial Build
grafana-char… │ 
grafana-char… │ Initial Build
grafana-char… │ Running cmd: helm repo add grafana-charts https://grafana.github.io/helm-charts --force-update
    vm-charts │ Running cmd: helm repo add vm-charts https://victoriametrics.github.io/helm-charts --force-update
ingress-char… │ 
ingress-char… │ Initial Build
ingress-char… │ Running cmd: helm repo add ingress-charts https://kubernetes.github.io/ingress-nginx --force-update
    vm-charts │ "vm-charts" has been added to your repositories
   vmoperator │ 
   vmoperator │ Initial Build
   vmoperator │ STEP 1/1 — Deploying
   vmoperator │      Running cmd: python3 /home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py
   vmoperator │      Running cmd: ['helm', 'upgrade', '--install', 'vmoperator', 'vm-charts/victoria-metrics-operator']
   vmoperator │      Release "vmoperator" does not exist. Installing it now.
ingress-char… │ "ingress-charts" has been added to your repositories
      grafana │ 
      grafana │ Initial Build
      grafana │ STEP 1/1 — Deploying
      grafana │      Running cmd: python3 /home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py --values grafana/values.yaml
      grafana │      Running cmd: ['helm', 'upgrade', '--install', '--values', 'grafana/values.yaml', 'grafana', 'grafana-charts/grafana']
      grafana │      Release "grafana" does not exist. Installing it now.
      grafana │      Error: repo grafana-charts not found
      grafana │      Traceback (most recent call last):
      grafana │        File "/home/runner/.local/share/tilt-dev/tilt_modules/github.com/tilt-dev/tilt-extensions/helm_resource/helm-apply-helper.py", line 66, in <module>
      grafana │          subprocess.check_call(install_cmd, stdout=sys.stderr)
      grafana │        File "/usr/lib/python3.10/subprocess.py", line [36](https://github.com/lacework/agent/actions/runs/5081096334/jobs/9137856036#step:7:37)9, in check_call
      grafana │          raise CalledProcessError(retcode, cmd)
      grafana │      subprocess.CalledProcessError: Command '['helm', 'upgrade', '--install', '--values', 'grafana/values.yaml', 'grafana', 'grafana-charts/grafana']' returned non-zero exit status 1.
      grafana │ 
      grafana │ ERROR: Build Failed: apply command exited with status 1

The Tiltfile section that caused that is

# Install Grafana Helm Charts
helm_repo(name='grafana-charts',
          url='https://grafana.github.io/helm-charts')
helm_resource(name='grafana',
              chart='grafana-charts/grafana',
              flags=['--values', './dev/grafana/values.yaml'],
              port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')

Here, grafana has a dependency on grafana-charts, but in the logs above, the chart was not found!

I went ahead and did some triaging in the hopes that it would be useful. My suspicion for what happened:

I propose that helm_repo's call to local_resource adds a new readiness_probe which uses an exec_action of helm repo list -o json | jq -e 'map(select(.name == <name-of-chart>)) != []' to determine success of chart installation.

Let me know if this all makes sense please!

nicks commented 1 year ago

I think you're misreading the docs.

local_resource readiness checks apply to the serve_cmd, not the build cmd. The helm_repo function uses the build cmd. I didn't see anything in your logs to suggest grafana deployed before grafana-charts finished.

nschmeller commented 1 year ago

Gotcha, thanks for the clarification -- could there be another reason why this would be flaky / racy?

nicks commented 1 year ago

You say you expected it to wait on resource_deps, but I don't see any resource_deps in the code snippet you posted?

nschmeller commented 1 year ago

My bad, I pasted the wrong snippet from the Tiltfile 😅 -- the failed CI run happened after I committed code with the resource_deps added:

diff --git a/monitoring/Tiltfile b/monitoring/Tiltfile
index 30cb8f76e4..e0a3696cd7 100644
--- a/monitoring/Tiltfile
+++ b/monitoring/Tiltfile
@@ -56,6 +56,7 @@ helm_repo(name='grafana-charts',
 helm_resource(name='grafana',
               chart='grafana-charts/grafana',
               flags=['--values', './dev/grafana/values.yaml'],
+              resource_deps=['grafana-charts'],
               port_forwards=[3000])
 watch_file('./dev/grafana/values.yaml')

So the state of the Tiltfile when this happened was actually

# Install Grafana Helm Charts
helm_repo(name='grafana-charts',
          url='https://grafana.github.io/helm-charts')
helm_resource(name='grafana',
              chart='grafana-charts/grafana',
              flags=['--values', './dev/grafana/values.yaml'],
              resource_deps=['grafana-charts'],
              port_forwards=[3000])
watch_file('./dev/grafana/values.yaml')

Sorry again for the confusion -- any ideas still?

nicks commented 1 year ago

I'm not 100% confident 'helm repos add' is concurrency safe. From your logs, it looks like two repo-adds (repos-add?) ran at the same time, but only one printed, which is a bit suspicious. Maybe we could try removing the allow_parallel param from helm_repo and see if that helps?

nschmeller commented 1 year ago

I'm not sure how to force the race, but I added the problematic code to the test Tiltfile in helm_resource in the hopes of reproducing. Let me know how that PR looks!