mumoshu / kube-airflow

A docker image and kubernetes config files to run Airflow on Kubernetes
Apache License 2.0
652 stars 204 forks source link

Scheduler defaults to 5 runs, so it goes into a CrashLoopBackoff when deployed #20

Open jdavidheiser opened 6 years ago

jdavidheiser commented 6 years ago

https://github.com/mumoshu/kube-airflow/blob/01ae78ad7dd8ab0037164f280530a21a1ca7057d/airflow.all.yaml#L194

This causes the file read loop to happen five times, then the scheduler exits. It seems like a strange default setup.

I'm a bit confused why this is set this way - shouldn't the scheduler be looping indefinitely? I'm also seeing the scheduler failing to queue up tasks, same as #19, and I wonder if this is the cause in that case, or something else.

gsemet commented 6 years ago

airflow is weird. The whole purpose of this setting is to let the scheduler kills itself periodically to reload DAGs. In kubernetes this does not have a huge impact since it will be restarted automatically, and the whole kill/restart cicle can take a while, but airflow does not do sub seconds precision.

-1 means you can never update your DAG, 1 means scheduler kills itself at every task launch

jdavidheiser commented 6 years ago

I feel like it would have less impact in Docker, but with Kube managing the pods it ends up putting the cluster in a not-happy state with backoffs because the exiting script looks like a crash. Thanks for the heads up on the motivation to exit after a few task runs - I'm going to modify the start shell script in my version of the Docker container. I think it makes sense to run the scheduler in a while loop but break if it returns a bad error code, so Kube can still manage those incidents as real crashes.

gsemet commented 6 years ago

feel free to submit a pull request. I do have my scheduler restarting regularly, I don't see problems except it takes a few minutes to power on (so delaying next dag start)

ryan-riopelle commented 5 years ago

The issue that I had with kubernetes is that it tracks the number of restarts, so if you run this application indefinitely you could see large reset numbers over a long period of time which would be a red flag to an administrator that runs "kubectl get pods" on the cluster, unless I am understanding it wrong.

As a solution, maybe this pod could be run as a kubernetes cronjob or kubernetes job. Change in YAML would be similar to below but have not fully debugged yet.

Would this break the way the scheduler works?

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: scheduler
  labels:
    app: airflow
    tier: scheduler
spec:
  schedule: "*/2 * * * *" #every 5 minutes
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: scheduler
            image: <image-location>
            # volumes:
            #     - /localpath/to/dags:/usr/local/airflow/dags
            env:
            - name: AIRFLOW_HOME
              value: "/usr/local/airflow"
            args: ["scheduler", "-n", "5"]
aditinabar commented 5 years ago

@gsemet How/where did you change the config for the scheduler to restart automatically? I'm not seeing it in airflow.cfg.

Lord-Y commented 5 years ago

@gsemet when scheduler args n != -1 it will restart and then go to CrashLoopBackOff later. You can see it in helm chart