cypressf commented 2 years ago

The bug

https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:27

======CMD======
[23](https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:23)
podman stop crm_backend && ln -snf ~/builds/696a6b8d3902f3e88af05c1f999d628648d318[24](https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:24) ~/climate-risk-map && podman run --rm -v ~/climate-risk-map/backend:/opt/climate-risk-map/backend:Z --tz=America/New_York --env-file=$HOME/.env --pod=crm_pod sqlx database reset -y && podman start crm_backend
24

[25](https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:25)
======END======
[26](https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:26)
out: crm_backend
[27](https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:27)
out: error: error communicating with database: Connection refused (os error 111)
[28](https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:28)
2022/02/20 04:43:46 Process exited with status 1

Specifically, it's happening on the sqlx migrate.

podman run --rm -v ~/climate-risk-map/backend:/opt/climate-risk-map/backend:Z --tz=America/New_York --env-file=$HOME/.env --pod=crm_pod sqlx migrate run

This has happened twice on svante3 and once on svante4. it's an especially bad deploy error because when it happens, it breaks the currently running deploy, so the site is offline until it's manually fixed by rebuilding the containers.

The manual workaround fix

podman pod stop crm_pod
podman pod rm  crm_pod
podman container list -a # just to verify there were no containers lying about
podman image rm <image_id> # rm a bunch of existing image ids, just in case there was something bad cached
cd /opt/climate_risk_map_builder/
./crm_build_wrapper.sh # rebuild the images, and start the pod

cypressf commented 2 years ago

I checked svante3 which just had the error to see what containers were running.

podman ps -a
CONTAINER ID  IMAGE                                      COMMAND     CREATED      STATUS                      PORTS                                           NAMES
65312102f305  k8s.gcr.io/pause:3.5                                   9 hours ago  Exited (135) 8 minutes ago  0.0.0.0:8002->4000/tcp, 0.0.0.0:8000->8000/tcp  be55b42f6b56-infra
af5ebd4ac574  docker.io/library/postgres:latest          postgres    9 hours ago  Exited (0) 9 hours ago      0.0.0.0:8002->4000/tcp, 0.0.0.0:8000->8000/tcp  crm_db
c9adc6c1e396  localhost/climate_risk_map_backend:latest  bash        9 hours ago  Exited (135) 9 hours ago    0.0.0.0:8002->4000/tcp, 0.0.0.0:8000->8000/tcp  crm_backend

crm_db and crm_backend both exited 9 hours ago, which implies that I'm observing this "bug" happening now because the database had already shut down wayyy before. Why would that be? That was shortly after the time of the last deploy on svante3, which completed at Sat, 19 Feb 2022 20:05:14 GMT

cypressf commented 2 years ago

Trying to see which container stopped first:

podman inspect -f '{{ .State.FinishedAt }}' crm_backend
2022-02-19 15:20:40.199326026 -0500 EST
podman inspect -f '{{ .State.FinishedAt }}' crm_db
2022-02-19 15:20:40.126328569 -0500 EST

Looks like thecrm_db stopped first 15:20:40.126 < 15:20:40.199

cypressf commented 2 years ago

The deploy completed at 20:05:14 GMT, and the error occurred at 20:20:40 GMT, about 15 minutes after the deploy. I'm curious what happened 15 minutes after the deploy to shut it down.

cypressf commented 2 years ago

It's strange that crm_db stopped with exit code 0. That implies postgres exited with no error. Why would it?

We can see in the logs, it "received smart shutdown request." Why would it have received this?

podman logs crm_db
2022-02-19 20:20:40.093 UTC [1] LOG:  received smart shutdown request
2022-02-19 20:20:40.093 UTC [48] FATAL:  terminating connection due to administrator command
2022-02-19 20:20:40.093 UTC [40] FATAL:  terminating connection due to administrator command
2022-02-19 20:20:40.093 UTC [43] FATAL:  terminating connection due to administrator command
2022-02-19 20:20:40.094 UTC [41] FATAL:  terminating connection due to administrator command
2022-02-19 20:20:40.094 UTC [42] FATAL:  terminating connection due to administrator command
2022-02-19 20:20:40.094 UTC [47] FATAL:  terminating connection due to administrator command
2022-02-19 20:20:40.096 UTC [1] LOG:  background worker "logical replication launcher" (PID 39) exited with exit code 1
2022-02-19 20:20:40.101 UTC [34] LOG:  shutting down
2022-02-19 20:20:40.121 UTC [1] LOG:  database system is shut down

Strangely, the server logs show that it was killed via a SIGTERM, which is also a "clean" exit.

podman logs crm_backend
[2022-02-19T20:20:40Z INFO  actix_server::builder] SIGTERM received, stopping
[2022-02-19T20:20:40Z INFO  actix_server::worker] Shutting down worker, 0 connections

Both those logs point to a command being issued to shut down the pod or the containers therein.

cypressf commented 2 years ago

for now I just ran podman pod start crm_pod and it looks like it's working just fine

cypressf commented 2 years ago

svante3 pod once again exited with no apparent error

podman inspect -f '{{ .State.StartedAt }} {{ .State.FinishedAt }}' crm_backend
2022-02-20 01:01:31.098286591 -0500 EST
2022-02-20 03:14:23.342879821 -0500 EST

cypressf commented 2 years ago

podman logs crm_backend
[2022-02-20T08:14:23Z INFO  actix_server::builder] SIGTERM received, stopping

cypressf commented 2 years ago

@mjbludwig and I met about this and he looked into the system logs. He found

There was an ssh session open near the time the containers shut down
There was a core dump near the time the containers shut down

let me know if I'm missing anything

Next steps, @mjbludwig investigates logs to see if there's anything specific happening with podman that we need to fix, or if this is potentially a bug in my application-level code.

cypressf commented 2 years ago

I'm trying rebuilding the containers on svante3 again.

podman pod rm crm_pod
podman image rm <all the existing image ids>

I got a failure to remove one of the images image is in use by container, and a buildah ls -a revealed a buildah container, so I removed that like so:

buildah rm <container_id>

then rebuilt the pod

cd /opt/climate_risk_map_builder
./crm_build_wrapper.sh

It completed with no errors

cypressf commented 1 year ago

I'm going to close this. I haven't seen it in a while. If it happens again I'll reopen.

mit-jp / mit-climate-data-viz

on deploy, Error communicating with database: Connection refused (os error 111) #263

The bug

The manual workaround fix