Open tejasj654 opened 6 years ago
I think the root cause of your issue is that you have issues with either your MySQL server itself or your configuration. But I wanted to add that there is an issue with Airflow where the webserver does not fully die if the gunicorn master process gets killed. So docker doesn't know to restart the webserver even if it's down. There is an open ticket for this issue: https://issues.apache.org/jira/browse/AIRFLOW-1235
A possible workaround on the docker-airflow side would be to poll for the existence of /usr/local/airflow/airflow-webserver.pid and to kill the container if that file ceases to exist. Then it would be restarted by docker compose/swarm. I think this change could possibly be implemented in entrypoint.sh.
Thanks. I will make use of this and test it out.
We faced the same error. It was but due to exhaustion of resources (OOM) when launching the Docker image in a Nomad cluster.
We are trying to set it Airflow in a Docker container running on AWS ECS. The same container runs correctly locally but fails to start the web server on AWS ECS.
This in a absolutely most basic configuration (Sqlite
, SequentialExecutor
, etc).
[2018-09-19 19:42:38,457] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:42:38,649] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:42:38,652] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:42:38,671] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:42:38,671] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:42:38,749] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:42:38,752] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:43:56,724] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:44:42 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:18)
[2018-09-19 19:46:10 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:19)
[2018-09-19 19:46:26 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:16)
[2018-09-19 19:46:37 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:17)
[2018-09-19 19:46:48,701] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:46:48 +0000] [16] [INFO] Worker exiting (pid: 16)
[2018-09-19 19:46:48 +0000] [17] [INFO] Worker exiting (pid: 17)
[2018-09-19 19:46:48,731] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 17)
[2018-09-19 19:46:48,730] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 16)
[2018-09-19 19:46:48 +0000] [20] [INFO] Booting worker with pid: 20
[2018-09-19 19:46:48 +0000] [21] [INFO] Booting worker with pid: 21
[2018-09-19 19:46:48 +0000] [22] [INFO] Booting worker with pid: 22
[2018-09-19 19:46:48 +0000] [23] [INFO] Booting worker with pid: 23
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,410] {{__init__.py:51}} INFO - Using executor SequentialExecutor
[2018-09-19 19:46:49,585] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,584] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,584] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,585] {{logging_config.py:62}} DEBUG - Unable to load custom logging, using default config instead
[2018-09-19 19:46:49,781] {{cli.py:717}} DEBUG - [0 / 4] some workers are starting up, waiting...
[2018-09-19 19:47:42,011] {{cli.py:717}} DEBUG - [0 / 2] some workers are starting up, waiting...
[2018-09-19 19:47:43,084] {{cli.py:748}} ERROR - [0 / 0] some workers seem to have died and gunicorndid not restart them as expected
[2018-09-19 19:47:54,237] {{cli.py:692}} DEBUG - [0 / 0] doing a refresh of 1 workers
[2018-09-19 19:48:00,495] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2018-09-19 19:48:00,593] {{models.py:352}} DEBUG - Importing /usr/local/airflow/dags/tuto.py
[2018-09-19 19:48:00,715] {{models.py:491}} DEBUG - Loaded DAG <DAG: tutorial>
[2018-09-19 19:48:17,956] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2018-09-19 19:48:17,955] {{models.py:258}} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2018-09-19 19:48:17,971] {{models.py:352}} DEBUG - Importing /usr/local/airflow/dags/tuto.py
[2018-09-19 19:48:17,972] {{models.py:352}} DEBUG - Importing /usr/local/airflow/dags/tuto.py
[2018-09-19 19:48:17,991] {{models.py:491}} DEBUG - Loaded DAG <DAG: tutorial>
[2018-09-19 19:48:17,991] {{models.py:491}} DEBUG - Loaded DAG <DAG: tutorial>
[2018-09-19 19:48:18 +0000] [23] [INFO] Parent changed, shutting down: <Worker 23>
[2018-09-19 19:48:18 +0000] [23] [INFO] Worker exiting (pid: 23)
[2018-09-19 19:48:18,395] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 23)
170.231.45.82 - - [19/Sep/2018:19:48:18 +0000] "GET / HTTP/1.1" 302 221 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15"
[2018-09-19 19:48:18 +0000] [20] [INFO] Parent changed, shutting down: <Worker 20>
[2018-09-19 19:48:18 +0000] [20] [INFO] Worker exiting (pid: 20)
[2018-09-19 19:48:18,423] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 20)
[2018-09-19 19:49:54,517] {{cli.py:755}} ERROR - No response from gunicorn master within 120 seconds
[2018-09-19 19:49:54,553] {{cli.py:756}} ERROR - Shutting down webserver
[2018-09-19 19:49:54,579] {{cli_action_loggers.py:81}} DEBUG - Calling callbacks: []
[2018-09-19 19:49:54,616] {{settings.py:188}} DEBUG - Disposing DB connection pool (PID 1)
Any ideas what could be behind this please? It seems gunicorn master does not respond and/or workers are timing out. My first thought was something wrong w/ Docker container network settings, but they look just fine I suppose:
"NetworkSettings": {
"Bridge": "",
"SandboxID": "8842dc8b5083781dfedbece65c670036c744811b82d738a696ec50b3202cd065",
"HairpinMode": false,
"LinkLocalIPv6Address": "",
"LinkLocalIPv6PrefixLen": 0,
"Ports": {
"5555/tcp": null,
"8080/tcp": [
{
"HostIp": "0.0.0.0",
"HostPort": "8080"
}
],
"8793/tcp": null
},
"SandboxKey": "/var/run/docker/netns/8842dc8b5083",
"SecondaryIPAddresses": null,
"SecondaryIPv6Addresses": null,
"EndpointID": "33181fcb6dc2569d9d1ccf312ff568b27a2839e5fda7b802429f785e18dcf76f",
"Gateway": "172.17.0.1",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"MacAddress": "02:42:ac:11:00:02",
"Networks": {
"bridge": {
"IPAMConfig": null,
"Links": null,
"Aliases": null,
"NetworkID": "af87fa2be368c96fe2336c11d3625d2fd746971181e4c5309d2d016cdb551e7a",
"EndpointID": "33181fcb6dc2569d9d1ccf312ff568b27a2839e5fda7b802429f785e18dcf76f",
"Gateway": "172.17.0.1",
"IPAddress": "172.17.0.2",
"IPPrefixLen": 16,
"IPv6Gateway": "",
"GlobalIPv6Address": "",
"GlobalIPv6PrefixLen": 0,
"MacAddress": "02:42:ac:11:00:02",
"DriverOpts": null
}
}
@aymericbeaumet thank you for sharing, increasing memory available in my cluster rectified the issue for me.
Also decreasing the number of gunicorn workers from the default 4 to 1 (for local development) worked for me. In the airflow.cfg
set the following
[webserver]
workers = 1
Or use the equivalent environment variable
export AIRFLOW__WEBSERVER__WORKERS=4
Starting minikube with more memory worked for me on mac.
1) First delete minikube in VM
minikube delete
2) Now start minikube with more memory
minikube start --driver='<driver_name>' --memory=4000m
Note: Run minikube start --help
and check out --memory flag to know more about units used for memory.
I took the docker-airflow as it is. Made two modifications.
After
docker-compose up -d
webserver works fine for a few mins. When I trigger a DAG manually, before the tasks are complete, webserver crashes with following error.[1] {cli.py:723} ERROR - [0 / 0] some workers seem to have died and gunicorndid not restart them as expected
After this, I am unable to access webserver in my browser. The DAG I triggered contains 7 tasks running in parallel.
I will try with a tutorial DAG to see if I can reproduce the issue.
Docker version: Version 17.06.0-ce-mac19 OS: OSX El Capitan Version 10.11.3 (15D21)