webserver stops in between a DAG run with "some workers seem to have died and gunicorn did not restart them as expected" error

tejasj654 commented 6 years ago

I took the docker-airflow as it is. Made two modifications.

I had a requirement to install airflow with MySQL support. So in Dockerfile I added libmysqlclient-dev to buildDeps and mysql to apache-airflow.
Changed the volume in docker-compose.yml to match my custom path where DAGs are located. (Using CeleryExecutor)

After docker-compose up -d webserver works fine for a few mins. When I trigger a DAG manually, before the tasks are complete, webserver crashes with following error.

[1] {cli.py:723} ERROR - [0 / 0] some workers seem to have died and gunicorndid not restart them as expected

After this, I am unable to access webserver in my browser. The DAG I triggered contains 7 tasks running in parallel.

I will try with a tutorial DAG to see if I can reproduce the issue.

Docker version: Version 17.06.0-ce-mac19 OS: OSX El Capitan Version 10.11.3 (15D21)

ajhodges commented 6 years ago

I think the root cause of your issue is that you have issues with either your MySQL server itself or your configuration. But I wanted to add that there is an issue with Airflow where the webserver does not fully die if the gunicorn master process gets killed. So docker doesn't know to restart the webserver even if it's down. There is an open ticket for this issue: https://issues.apache.org/jira/browse/AIRFLOW-1235

A possible workaround on the docker-airflow side would be to poll for the existence of /usr/local/airflow/airflow-webserver.pid and to kill the container if that file ceases to exist. Then it would be restarted by docker compose/swarm. I think this change could possibly be implemented in entrypoint.sh.

ghost commented 6 years ago

Thanks. I will make use of this and test it out.

aymericbeaumet commented 6 years ago

We faced the same error. It was but due to exhaustion of resources (OOM) when launching the Docker image in a Nomad cluster.

rodmaz commented 5 years ago

We are trying to set it Airflow in a Docker container running on AWS ECS. The same container runs correctly locally but fails to start the web server on AWS ECS. This in a absolutely most basic configuration (Sqlite, SequentialExecutor, etc).

[2018-09-19 19:42:38,457] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:42:38,649] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:42:38,652] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead 
[2018-09-19 19:42:38,671] {{__init__.py:51}} INFO - Using executor SequentialExecutor 
[2018-09-19 19:42:38,671] {{__init__.py:51}} INFO - Using executor SequentialExecutor 
[2018-09-19 19:42:38,749] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:42:38,752] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:43:56,724] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:44:42 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:18)
[2018-09-19 19:46:10 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:19)
[2018-09-19 19:46:26 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:16)
[2018-09-19 19:46:37 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17)
[2018-09-19 19:46:48,701] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:46:48 +0000] [16] [INFO] Worker exiting (pid: 16)
[2018-09-19 19:46:48 +0000] [17] [INFO] Worker exiting (pid: 17)
[2018-09-19 19:46:48,731] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 17)
[2018-09-19 19:46:48,730] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 16)
[2018-09-19 19:46:48 +0000] [20] [INFO] Booting worker with pid: 20
[2018-09-19 19:46:48 +0000] [21] [INFO] Booting worker with pid: 21
[2018-09-19 19:46:48 +0000] [22] [INFO] Booting worker with pid: 22
[2018-09-19 19:46:48 +0000] [23] [INFO] Booting worker with pid: 23
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,585] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,584] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,584] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,585] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,781] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:47:42,011] {{cli.py:717}} DEBUG - [0 / 2] some workers are starting up, waiting...
[2018-09-19 19:47:43,084] {{cli.py:748}} ERROR - [0 / 0] some workers seem to have died and gunicorndid not restart them as expected
[2018-09-19 19:47:54,237] {{cli.py:692}} DEBUG - [0 / 0] doing a refresh of 1 workers
[2018-09-19 19:48:00,495] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2018-09-19 19:48:00,593] {{models.py:352}} DEBUG - Importing /usr/local/airflow/dags/tuto.py
[2018-09-19 19:48:00,715] {{models.py:491}} DEBUG - Loaded DAG <DAG: tutorial>
[2018-09-19 19:48:17,956] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2018-09-19 19:48:17,955] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2018-09-19 19:48:17,971] {{models.py:352}} DEBUG - Importing /usr/local/airflow/dags/tuto.py
[2018-09-19 19:48:17,972] {{models.py:352}} DEBUG - Importing /usr/local/airflow/dags/tuto.py
[2018-09-19 19:48:17,991] {{models.py:491}} DEBUG - Loaded DAG <DAG: tutorial>
[2018-09-19 19:48:17,991] {{models.py:491}} DEBUG - Loaded DAG <DAG: tutorial>
[2018-09-19 19:48:18 +0000] [23] [INFO] Parent changed, shutting down: <Worker 23>
[2018-09-19 19:48:18 +0000] [23] [INFO] Worker exiting (pid: 23)
[2018-09-19 19:48:18,395] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 23)
170.231.45.82 - - [19/Sep/2018:19:48:18 +0000] "GET / HTTP/1.1" 302 221 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15"
[2018-09-19 19:48:18 +0000] [20] [INFO] Parent changed, shutting down: <Worker 20>
[2018-09-19 19:48:18 +0000] [20] [INFO] Worker exiting (pid: 20)
[2018-09-19 19:48:18,423] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 20)
[2018-09-19 19:49:54,517] {{cli.py:755}} ERROR - No response from gunicorn master within 120 seconds
[2018-09-19 19:49:54,553] {{cli.py:756}} ERROR - Shutting down webserver
[2018-09-19 19:49:54,579] {{cli_action_loggers.py:81}} DEBUG - Calling callbacks: []
[2018-09-19 19:49:54,616] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 1)

Any ideas what could be behind this please? It seems gunicorn master does not respond and/or workers are timing out. My first thought was something wrong w/ Docker container network settings, but they look just fine I suppose:

        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "8842dc8b5083781dfedbece65c670036c744811b82d738a696ec50b3202cd065",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {
                "5555/tcp": null,
                "8080/tcp": [
                    {
                        "HostIp": "0.0.0.0",
                        "HostPort": "8080"
                    }
                ],
                "8793/tcp": null
            },
            "SandboxKey": "/var/run/docker/netns/8842dc8b5083",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "33181fcb6dc2569d9d1ccf312ff568b27a2839e5fda7b802429f785e18dcf76f",
            "Gateway": "172.17.0.1",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "172.17.0.2",
            "IPPrefixLen": 16,
            "IPv6Gateway": "",
            "MacAddress": "02:42:ac:11:00:02",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "af87fa2be368c96fe2336c11d3625d2fd746971181e4c5309d2d016cdb551e7a",
                    "EndpointID": "33181fcb6dc2569d9d1ccf312ff568b27a2839e5fda7b802429f785e18dcf76f",
                    "Gateway": "172.17.0.1",
                    "IPAddress": "172.17.0.2",
                    "IPPrefixLen": 16,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "02:42:ac:11:00:02",
                    "DriverOpts": null
                }
            }

kharmabum commented 4 years ago

@aymericbeaumet thank you for sharing, increasing memory available in my cluster rectified the issue for me.

dinigo commented 4 years ago

Also decreasing the number of gunicorn workers from the default 4 to 1 (for local development) worked for me. In the airflow.cfg set the following

[webserver]
workers = 1

Or use the equivalent environment variable

export AIRFLOW__WEBSERVER__WORKERS=4

vijayakumarpattanashetti commented 3 years ago

Starting minikube with more memory worked for me on mac.

1) First delete minikube in VM minikube delete

2) Now start minikube with more memory minikube start --driver='<driver_name>' --memory=4000m

Note: Run minikube start --help and check out --memory flag to know more about units used for memory.

puckel / docker-airflow

webserver stops in between a DAG run with "some workers seem to have died and gunicorn did not restart them as expected" error #125