mit-jp / mit-climate-data-viz

Plotting climate data for the MIT Joint Program on the Science and Policy of Global Change
https://cypressf.shinyapps.io/eppa-dashboard/
0 stars 0 forks source link

svante3 deploy fails: container does not exist, timed out waiting for file #298

Closed cypressf closed 1 year ago

cypressf commented 1 year ago

https://github.com/cypressf/climate-risk-map/actions/runs/3708159249/jobs/6285444953#step:7:27

cypressf commented 1 year ago

@mjbludwig I'm still seeing this error when trying to run on svante3

======CMD======
podman pod start crm_pod && podman stop crm_backend && ln -snf ~/builds/2ef0475b598d0861d138c01f2c08c7274e2f1a84 ~/climate-risk-map && podman run --rm -v ~/climate-risk-map/backend:/opt/climate-risk-map/backend:Z --tz=America/New_York --env-file=$HOME/.env --pod=crm_pod sqlx database reset -y && podman start crm_backend

======END======
out: 84e5e2e8314f6a925402f7c5a357148c6def6c5882325a6b3aad5183ceefdc51
err: time="2022-12-15T16:42:38-05:00" level=error msg="container \"30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3\" does not exist"
err: Error: timed out waiting for file /tmp/podman-run-1004/libpod/tmp/exits/30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3: internal libpod error
2022/12/15 21:42:43 Process exited with status 125
cypressf commented 1 year ago

After seeing that deploy error, I ssh'd into svante3 and checked

podman ps
CONTAINER ID  IMAGE                                    COMMAND     CREATED       STATUS           PORTS                                           NAMES
59061d5a907c  localhost/podman-pause:4.2.0-1669122264              20 hours ago  Up 20 hours ago  0.0.0.0:8000->8000/tcp, 0.0.0.0:8002->4000/tcp  84e5e2e8314f-infra
e4866c7a53b5  docker.io/library/postgres:14            postgres    20 hours ago  Up 20 hours ago  0.0.0.0:8000->8000/tcp, 0.0.0.0:8002->4000/tcp  crm_db

but even though crm_backend was not running, the web api was still responding to queries on svante3, so I checked for its process, and noticed that it was living beyond the container:

ps aux | grep crm_backend
crm_web+  132570  0.0  0.0 143828  2204 ?        Ssl  Dec15   0:00 /usr/bin/conmon --api-version 1 -c 30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3 -u 30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3 -r /usr/bin/runc -b /opt/crm_home_dir/.local/share/containers/storage/overlay-containers/30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3/userdata -p /tmp/podman-run-1004/containers/overlay-containers/30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3/userdata/pidfile -n crm_backend --exit-dir /tmp/podman-run-1004/libpod/tmp/exits --full-attach -l k8s-file:/opt/crm_home_dir/.local/share/containers/storage/overlay-containers/30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3/userdata/ctr.log --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/tmp/podman-run-1004/containers/overlay-containers/30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3/userdata/oci-log --conmon-pidfile /tmp/podman-run-1004/containers/overlay-containers/30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /opt/crm_home_dir/.local/share/containers/storage --exit-command-arg --runroot --exit-command-arg /tmp/podman-run-1004/containers --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg cgroupfs --exit-command-arg --tmpdir --exit-command-arg /tmp/podman-run-1004/libpod/tmp --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg cni --exit-command-arg --volumepath --exit-command-arg /opt/crm_home_dir/.local/share/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg 30b2e1d59ec15ef79a9d614496bdc2e11eff1e2970d1cd206f02d3f2a952bce3

So I killed it

kill 132570
ps aux | grep crm_backend

Now I'm restarting the backend container process to see what happens.

cypressf commented 1 year ago
podman pod start crm_pod
podman ps
59061d5a907c  localhost/podman-pause:4.2.0-1669122264                20 hours ago  Up 20 hours ago   0.0.0.0:8000->8000/tcp, 0.0.0.0:8002->4000/tcp  84e5e2e8314f-infra
e4866c7a53b5  docker.io/library/postgres:14              postgres    20 hours ago  Up 20 hours ago   0.0.0.0:8000->8000/tcp, 0.0.0.0:8002->4000/tcp  crm_db
30b2e1d59ec1  localhost/climate_risk_map_backend:latest  bash        20 hours ago  Up 9 seconds ago  0.0.0.0:8000->8000/tcp, 0.0.0.0:8002->4000/tcp  crm_backend
podman stop crm_backend
ln -snf ~/builds/2ef0475b598d0861d138c01f2c08c7274e2f1a84 ~/climate-risk-map
podman run --rm -v ~/climate-risk-map/backend:/opt/climate-risk-map/backend:Z --tz=America/New_York --env-file=$HOME/.env --pod=crm_pod sqlx database reset -y

those all ran successfully, but I was kicked from the host before i could run

podman start crm_backend

perhaps the server is undergoing maintenance or something

cypressf commented 1 year ago

Actually @mjbludwig I can't seem to connect to svante3 any more. lmk if it's down.