ministryofjustice / analytics-platform

Parent repository for the MOJ Analytics Platform
MIT License
14 stars 1 forks source link

Airflow tasks are killed on Kubernetes #118

Closed isichei closed 1 year ago

isichei commented 4 years ago

What happened?

Airflow tasks (being ran as Pods on Kubernetes) stop (or are killed) unexpectedly (logs state SIGTERM is sent to task). This process is intermittent (no common theme around when it happens) so far we have not been able to map it to any specific script, package or docker image. The SIGTERM seems to be sent to multiple tasks if they are currently running when the SIGTERM is sent (even if the tasks are related to different DAGs). This makes us believe that something outside of Airflow is sending the SIGTERMs to the Pods (although not certain).

Example of airflow task logs:

[2020-05-15 07:31:52,314] {logging_mixin.py:95} INFO - [2020-05-15 07:31:52,314] {jobs.py:2536} ERROR - Received SIGTERM. Terminating subprocesses
[2020-05-15 07:31:52,315] {helpers.py:281} INFO - Sending Signals.SIGTERM to GPID 15
[2020-05-15 07:31:52,315] {__init__.py:1416} ERROR - Received SIGTERM. Terminating subprocesses.
[2020-05-15 07:31:53,024] {__init__.py:1580} ERROR - Pod Launching failed: Task received SIGTERM signal

Further Notes

Potential guesses for cause of SIGTERM:

Additional findings:

davidread commented 4 years ago

This most recently occurred 7.31 UTC on Friday 15th, as per the log snippet above. Example logs of it occurring at the exact same time on more than one pod:

Chat thread: https://mojdt.slack.com/archives/C58G63XK5/p1589535763346400