Closed cypressf closed 1 year ago
I checked svante3 which just had the error to see what containers were running.
podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
65312102f305 k8s.gcr.io/pause:3.5 9 hours ago Exited (135) 8 minutes ago 0.0.0.0:8002->4000/tcp, 0.0.0.0:8000->8000/tcp be55b42f6b56-infra
af5ebd4ac574 docker.io/library/postgres:latest postgres 9 hours ago Exited (0) 9 hours ago 0.0.0.0:8002->4000/tcp, 0.0.0.0:8000->8000/tcp crm_db
c9adc6c1e396 localhost/climate_risk_map_backend:latest bash 9 hours ago Exited (135) 9 hours ago 0.0.0.0:8002->4000/tcp, 0.0.0.0:8000->8000/tcp crm_backend
crm_db
and crm_backend
both exited 9 hours ago, which implies that I'm observing this "bug" happening now because the database had already shut down wayyy before. Why would that be? That was shortly after the time of the last deploy on svante3, which completed at Sat, 19 Feb 2022 20:05:14 GMT
Trying to see which container stopped first:
podman inspect -f '{{ .State.FinishedAt }}' crm_backend
2022-02-19 15:20:40.199326026 -0500 EST
podman inspect -f '{{ .State.FinishedAt }}' crm_db
2022-02-19 15:20:40.126328569 -0500 EST
Looks like thecrm_db
stopped first 15:20:40.126
< 15:20:40.199
The deploy completed at 20:05:14 GMT, and the error occurred at 20:20:40 GMT, about 15 minutes after the deploy. I'm curious what happened 15 minutes after the deploy to shut it down.
It's strange that crm_db
stopped with exit code 0. That implies postgres exited with no error. Why would it?
We can see in the logs, it "received smart shutdown request." Why would it have received this?
podman logs crm_db
2022-02-19 20:20:40.093 UTC [1] LOG: received smart shutdown request
2022-02-19 20:20:40.093 UTC [48] FATAL: terminating connection due to administrator command
2022-02-19 20:20:40.093 UTC [40] FATAL: terminating connection due to administrator command
2022-02-19 20:20:40.093 UTC [43] FATAL: terminating connection due to administrator command
2022-02-19 20:20:40.094 UTC [41] FATAL: terminating connection due to administrator command
2022-02-19 20:20:40.094 UTC [42] FATAL: terminating connection due to administrator command
2022-02-19 20:20:40.094 UTC [47] FATAL: terminating connection due to administrator command
2022-02-19 20:20:40.096 UTC [1] LOG: background worker "logical replication launcher" (PID 39) exited with exit code 1
2022-02-19 20:20:40.101 UTC [34] LOG: shutting down
2022-02-19 20:20:40.121 UTC [1] LOG: database system is shut down
Strangely, the server logs show that it was killed via a SIGTERM
, which is also a "clean" exit.
podman logs crm_backend
[2022-02-19T20:20:40Z INFO actix_server::builder] SIGTERM received, stopping
[2022-02-19T20:20:40Z INFO actix_server::worker] Shutting down worker, 0 connections
Both those logs point to a command being issued to shut down the pod or the containers therein.
for now I just ran podman pod start crm_pod
and it looks like it's working just fine
svante3 pod once again exited with no apparent error
podman inspect -f '{{ .State.StartedAt }} {{ .State.FinishedAt }}' crm_backend
2022-02-20 01:01:31.098286591 -0500 EST
2022-02-20 03:14:23.342879821 -0500 EST
podman logs crm_backend
[2022-02-20T08:14:23Z INFO actix_server::builder] SIGTERM received, stopping
@mjbludwig and I met about this and he looked into the system logs. He found
let me know if I'm missing anything
Next steps, @mjbludwig investigates logs to see if there's anything specific happening with podman that we need to fix, or if this is potentially a bug in my application-level code.
I'm trying rebuilding the containers on svante3 again.
podman pod rm crm_pod
podman image rm <all the existing image ids>
I got a failure to remove one of the images image is in use by container
, and a buildah ls -a
revealed a buildah container, so I removed that like so:
buildah rm <container_id>
then rebuilt the pod
cd /opt/climate_risk_map_builder
./crm_build_wrapper.sh
It completed with no errors
I'm going to close this. I haven't seen it in a while. If it happens again I'll reopen.
The bug
https://github.com/cypressf/climate-risk-map/runs/5262666947?check_suite_focus=true#step:7:27
Specifically, it's happening on the
sqlx migrate
.This has happened twice on svante3 and once on svante4. it's an especially bad deploy error because when it happens, it breaks the currently running deploy, so the site is offline until it's manually fixed by rebuilding the containers.
The manual workaround fix