marinebon / mbon-dashboard-server

server software for MBON early alert dashboard using Docker
1 stars 2 forks source link

airflow worker restarts every few hours, no jobs get done #42

Open 7yl4r opened 1 year ago

7yl4r commented 1 year ago

I saw this issue in the docker logs.

I brought down all airflow-related containers (but left grafana and influx up so the existing data isn't affected).

Then brought them back up w/ docker compose up --build -d.

Jobs appear to be completing now. Will check on the data tomorrow.

7yl4r commented 1 year ago

Seeing jobs failing as "Not Yet started" in airflow web GUI with a weird error also when trying to get the task logfile.

reset command:

docker container restart mbon-dashboard-server-airflow-worker-1 mbon-dashboard-server-airflow-webserver-1 mbon-dashboard-server-airflow-scheduler-1 mbon-dashboard-server-flower-1 mbon-dashboard-server-redis-1 mbon-dashboard-server-postgres-1

after doing this they are working again.

7yl4r commented 1 year ago

This is an ongoing issue. When trying to view a job log in the airflow web GUI:

*** Log file does not exist: /opt/airflow//logs/ts_ingest/ingest_sat_roi_fgb_MODA_chlor_a_SS1/2023-05-20T00:00:00+00:00/1.log
*** Fetching from: http://:8793/log/ts_ingest/ingest_sat_roi_fgb_MODA_chlor_a_SS1/2023-05-20T00:00:00+00:00/1.log
*** Failed to fetch log file from worker. The request to ':///' is missing either an 'http://'                         or 'https://' protocol.
7yl4r commented 1 year ago

seeing the same issue on fknms board now

7yl4r commented 1 year ago

Trying to restart one container at a time to narrow down where the issue might be. After restarting the container I wait ~15min, then clear a DAG and observe the tasks

container name t waited status
mbon-dashboard-server-airflow-worker-1 00:15 no change
mbon-dashboard-server-airflow-scheduler-1 04:00 no change
mbon-dashboard-server-airflow-webserver-1 00:10 no change
mbon-dashboard-server-redis-1 00:15 working again.
7yl4r commented 1 year ago

From docker logs on the redis container:

* Connecting to MASTER 194.38.20.196:8886
* MASTER <-> REPLICA sync started
# Error condition on socket for SYNC: Connection refused

related SO Q

7yl4r commented 1 year ago

restarting the fknms board to see if 9c8910b actually fixed it:

tylarmurray@fknms-dashboard-04:~/mbon-dashboard-server$ docker compose down --volumes --rmi all && docker compose up airflow-init && sudo chmod -R 777 airflow/ influxdb/ grafana/ postgres/ && docker compose up airflow-init && docker compose up --build -d
7yl4r commented 1 year ago

doing the same for fgbnms:

tylarmurray@fgbnms-dashboard-02:~/mbon-dashboard-server$ docker compose down --volumes --rmi all && docker compose up airflow-init && sudo chmod -R 777 airflow/ influxdb/ grafana/ postgres/ && docker compose up airflow-init && docker compose up --build -d