mit-jp / mit-climate-data-viz

Plotting climate data for the MIT Joint Program on the Science and Policy of Global Change
https://cypressf.shinyapps.io/eppa-dashboard/
0 stars 0 forks source link

svante3 deploy error: "crm_db" is being accessed by other users #283

Closed cypressf closed 2 years ago

cypressf commented 2 years ago

https://github.com/cypressf/climate-risk-map/runs/7345060227?check_suite_focus=true#step:7:29

cypressf commented 2 years ago

I'm unsure as to why this would be happening. I'm stopping the web service before running the database migration, so I wouldn't expect the web service to be using "crm_db." I'm curious if there's some stuck process that's still using crm_db despite stopping the crm_backend pod. @mjbludwig

cypressf commented 2 years ago

Another strange thing is the crm_backend process still seems to be running even though the crm_backend pod has exited.

cypressf commented 2 years ago

Trying to kill everything running... I followed my previous steps here

https://github.com/cypressf/mit-climate-data-viz/issues/263#issuecomment-1050032952

but the web service and database are still working, because I can visit svante3.mit.edu and it works just fine. It's wild that the processes survived somehow!

cypressf commented 2 years ago
lsof -P -i TCP -s TCP:LISTEN
COMMAND       PID        USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
rootlessp 3707853 crm_website   10u  IPv6 14348449      0t0  TCP *:8000 (LISTEN)
rootlessp 3707853 crm_website   12u  IPv6 14348450      0t0  TCP *:8002 (LISTEN)
cypressf commented 2 years ago
kill 3707853
lsof -P -i TCP -s TCP:LISTEN
[no processes using ports]
cypressf commented 2 years ago

using the above kill I finally stopped the process that was serving the website api, and now the website no longer loads data. I'm concerned process is still running, however, and want to try to kill that as well if possible.

cypressf commented 2 years ago

showing all remaining processes run by the crm_website user reveals a process called climate_risk_ma that looks a little suspicious.... why is the name climate_risk_ma without the p at the end? why is it still running without a pod?

ps -u crm_website xo pid,stat,start,time,comm
    PID STAT  STARTED     TIME COMMAND
1499691 Ss   13:38:47 00:00:00 systemd
1499693 S    13:38:47 00:00:00 (sd-pam)
1499699 S    13:38:47 00:00:00 sshd
1499700 Ssl  13:38:47 00:00:00 fish
1499834 Ss   13:38:52 00:00:00 dbus-daemon
1533173 S    14:17:14 00:00:00 sshd
1533174 Ss+  14:17:14 00:00:00 fish
1536937 R+   14:25:24 00:00:00 ps
3705516 S      Jun 24 00:00:00 catatonit
3707846 Ss     Jun 24 00:00:00 fuse-overlayfs
3707848 S      Jun 24 00:00:08 slirp4netns
3707868 Ssl    Jun 24 00:00:00 conmon
3707881 Ss     Jun 24 00:00:00 catatonit
3707895 Ss     Jun 24 00:00:00 fuse-overlayfs
3707898 Ssl    Jun 24 00:00:00 conmon
3750852 Ss     Jun 24 00:00:00 fuse-overlayfs
3750861 Ssl    Jun 24 00:00:00 conmon
3750874 Ssl    Jun 24 00:27:24 climate_risk_ma
cypressf commented 2 years ago

I killed it

pkill climate_risk_ma
ps -u crm_website xo pid,stat,start,time,comm
    PID STAT  STARTED     TIME COMMAND
1499691 Ss   13:38:47 00:00:00 systemd
1499693 S    13:38:47 00:00:00 (sd-pam)
1499699 S    13:38:47 00:00:00 sshd
1499700 Ssl  13:38:47 00:00:00 fish
1499834 Ss   13:38:52 00:00:00 dbus-daemon
1533173 S    14:17:14 00:00:00 sshd
1533174 Ss+  14:17:14 00:00:00 fish
1539087 R+   14:26:57 00:00:00 ps
3705516 S      Jun 24 00:00:00 catatonit
3707846 Ss     Jun 24 00:00:00 fuse-overlayfs
3707848 S      Jun 24 00:00:08 slirp4netns
3707868 Ssl    Jun 24 00:00:00 conmon
3707881 Ss     Jun 24 00:00:00 catatonit
3707895 Ss     Jun 24 00:00:00 fuse-overlayfs
3707898 Ssl    Jun 24 00:00:00 conmon
cypressf commented 2 years ago

I rebuilt the containers and started the pod using the script.

cd /opt/climate_risk_map_builder
./crm_build_wrapper.sh

It executed with no errors and is running the pod including the web service successfully on svante3 again.

cypressf commented 2 years ago

https://github.com/cypressf/climate-risk-map/runs/7362210004?check_suite_focus=true

I reran the deploy to development github job, and it worked this time with no "crm_db" is being accessed by other users error. Closing this issue as fixed for now. Can reinvestigate if this error pops up again.