tilt-dev / tilt

Define your dev environment as code. For microservice apps on Kubernetes.
https://tilt.dev/
Apache License 2.0
7.66k stars 301 forks source link

local_resource with serve_cmd hangs when restarting #4455

Closed ZacxDev closed 3 years ago

ZacxDev commented 3 years ago

Expected Behavior

the local process should be killed and restarted

note: this works as of tilt v0.18.12 but is broken in v0.19.6 and some other newer versions

Current Behavior

the resource goes into a "pending" state and hangs there forever

Steps to Reproduce

  1. have a local_resource with a serve_cmd in tilt v0.19.6
  2. try to restart it via web trigger or dep change
  3. see it hang

Context

tilt doctor Output

$ tilt doctor
Tilt: v0.19.6, built 2021-04-19
System: linux-amd64
---
Docker
- Host: [default]
- Version: 1.41
- Builder: 2
---
Kubernetes
- Env: k3d
- Context: k3d-dev-cluster
- Cluster Name: k3d-dev-cluster
- Namespace: default
- Container Runtime: containerd
- Version: v1.20.5+k3s1
- Cluster Local Registry: none
---
Thanks for seeing the Tilt Doctor!
Please send the info above when filing bug reports. 💗

The info below helps us understand how you're using Tilt so we can improve,
but is not required to ask for help.
---
Analytics Settings
--> (These results reflect your personal opt in/out status and may be overridden by an `analytics_settings` call in your Tiltfile)
- User Mode: opt-in
- Machine: 4c75bdbcf791bf962731aacef4a92002
- Repo: V5jmyrpRTMUrVSrqP8rb3A==

About Your Use Case

we are trying to have webpack auto-restart when the .env file changes

here is the local_resource from our tiltfile:

local_resource(
  'run: mimobl.com',
  serve_cmd='cd ui/mimobl.com && (npm run dev & npm run build)',
  deps = ['ui/mimobl.com/.env', 'ui/mimobl.com/package.json'],
  readiness_probe=probe(
    period_secs=15,
    http_get=http_get_action(port=9000, path="/")
  ),
  links=['http://localhost:9000'],
)
nicks commented 3 years ago

I played around with this a bit and was not able to reproduce it.

Since I don't have access to your npm code, I added:

local_resource(
  'wait: tilt.dev',
  serve_cmd='./wait.sh; echo done')

where wait.sh looks like this:

#!/bin/bash

function cleanup()
{
    echo "cleanup 1 sec"
    sleep 1
    echo "cleanup 10 sec"
    sleep 10
    echo "cleanup 300 sec"
    sleep 300
}

trap cleanup EXIT

echo "sleep 1000 sec"
sleep 1000

but everything worked as expected (tilt eventually killed the process group).

When you hit the "restart" button, do you see any output like "Waiting 30s for process to exit" in the logs?

nicks commented 3 years ago

ooh, i had an idea - when you hit the restart button, does the resource change to say "Pending" like this?

Screenshot from 2021-04-20 18-19-34

i think there's been a long-standing issue where sometimes the scheduler won't parallelize things right, and we should fix that, tho i could easily see that getting misinterpreted as "hanging" if you didn't notice the extra variable here

ZacxDev commented 3 years ago

ooh, i had an idea - when you hit the restart button, does the resource change to say "Pending" like this?

Screenshot from 2021-04-20 18-19-34

i think there's been a long-standing issue where sometimes the scheduler won't parallelize things right, and we should fix that, tho i could easily see that getting misinterpreted as "hanging" if you didn't notice the extra variable here

Thanks for the quick replies, it is much appreciated Here is what it looks like when I hit restart:

Screen Shot 2021-04-20 at 5 28 40 PM

Could it have something to do with us running 2 processes concurrently in the local_resource serve_cmd? ex: serve_cmd='cd ui/mimobl.com && (npm run dev & npm run build)' (note the single '&')

When I wrote that I assumed wrapping it in a subshell would make it return one PID to tilt but I might be completely wrong there.

This works totally fine in v0.18.x which is what has me thinking it is a tilt bug and not user-error, as we haven't changed this command in some time.

nicks commented 3 years ago

A huge amount has changed between v0.18 and v0.19 in how serve commands are run (we basically rewrote it). I wouldn't be surprised if there was a regression, but it's not the kind of regression we'd be able to narrow down easily.

Probably easiest way to address this is to find a repro case.

I tried the npm run dev & npm run build idiom with a sample webpack project and restarting worked ok... Will try some other repro cases tomorrow.

ZacxDev commented 3 years ago

I tried replacing the parathesis with brackets and that seems to have resolved it. From a quick look at the bash manual it looks like the main difference is the creation of a subshell: https://www.gnu.org/software/bash/manual/html_node/Command-Grouping.html

 local_resource(                                                                                                                                                                                                                
       'run: mimobl.com',                                                                                                                                                                                                           
     serve_cmd='cd ui/mimobl.com && { npm run dev & npm run build; }',                                                                                                                                                            
       deps = ['ui/mimobl.com/.env', 'ui/mimobl.com/package.json'],                                                                                                                                                                 
       readiness_probe=probe(                                                                                                                                                                                                       
         period_secs=15,                                                                                                                                                                                                            
        http_get=http_get_action(port=9000, path="/")                                                                                                                                                                              
       ),                                                                                                                                                                                                                           
       links=['http://localhost:9000'],                                                                                                                                                                                             
    )

I hope that helps you narrow down the bug, thanks again for your help!

nicks commented 3 years ago

Thanks for the update! will still keep trying to repro and log an isolated case if we can. thanks for reporting!