villasv / aws-airflow-stack

Turbine: the bare metals that gets you Airflow
https://victor.villas/aws-airflow-stack/
MIT License
376 stars 69 forks source link

Scaling back instances might kill running tasks #36

Closed villasv closed 5 years ago

villasv commented 6 years ago

The stack autoscaling model is not flexible at all: scale in response to the queue length. But except for DAGs consisting solely of quick operations, this is problematic.

If the downscaling in response to 0 queued tasks is N seconds, any tasks that take N+1 seconds might get killed. That's just an obvious example, you don't need to get execution times close to N to get problems if you have many tasks because the average queue length can still be close to zero.

It's very unlikely that there can be a one-size-fits-all autoscaling strategy, this one was implemented because it was easy and useful. The problem is, we should strive for a transparent infrastructure and should not have things like an Operator that allocates 5 machines, for example. DAGs should be all about data.

villasv commented 6 years ago

There's not much to do except to recognize that autoscaling workers is a difficult task to automate without proper knowledge of the tasks being executed. Here's what can be done:

Make the Shrink alarm trigger to always-empty queues (use Minimum statistic) for very long periods of time (longer than any task should be). It should default to something big (like 6 hours) and be adjustable with a template parameter.

Ideally we should watch out for running tasks, but I can't see an immediately available way to achieve that. They're not visible directly from resources, but only with a SQL query to the Airflow meta database.

villasv commented 5 years ago

As long as this has a default, no matter how big, it's an opinionated aspect that may cause bugs. Let's just expose the period of the statistic as a parameter and let people come up with their own adjustment.

villasv commented 5 years ago

Strategy Proposal

Reasoning

Scaling in always has a chance of killing that one worker currently running a task, if there is one. So we should avoid scaling in until we're reasonably sure that all tasks are done.

Currently we can't actually tell if there are tasks running, so we can only hope to guess that all tasks are done based on their duration. So scaling in should happen Z seconds after the queue becomes empty, where Z is greater than the maximum running time for any task.

If the maximum running time has a high variance, we can use our parameters to gamble. By picking a timeout Z that has a small probability of being ran over, we also have the probability of accidentally killing the task. The task is then going to be rescheduled, resetting the Z counter.

These probabilities depend on the task characteristics, but also on the cooldown B. Then more time we take between scale in actions, the less likely it becomes to accidentally kill our running task.

villasv commented 5 years ago

I believe this is close to the best we can get if we reuse workers between tasks. Perhaps one day we will use KubernetesExecutor where we have complete control with worker per tasks, or maybe even something similar with ECS in the future.

villasv commented 5 years ago

According to the new autoscaling model, this is still a thing. I think that its in the nature of auto scaling (with SQS at least) to run the risk of interrupting a task. This issue has been close ind favor of https://github.com/villasv/aws-airflow-stack/issues/65.

When #65 is solved, killing busy workers is not a big deal for short lived tasks. Long lived tasks of course require much more caution and thus auto scaling needs to be properly configured.

villasv commented 5 years ago

https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

aws autoscaling record-lifecycle-action-heartbeat --instance-id $(ec2-metadata -i | awk '{print $2}') --region us-east-1 --lifecycle-hook-name 'airtest-scaling-lfhook' --auto-scaling-group-name 'airtest-scaling-group'
aws autoscaling complete-lifecycle-action --lifecycle-action-result CONTINUE --instance-id $(ec2-metadata -i | awk '{print $2}') --region us-east-1 --lifecycle-hook-name 'airtest-scaling-lfhook' --auto-scaling-group-name 'airtest-scaling-group'
villasv commented 5 years ago

The hook is place, now we need to take advantage of systemd ExecStop, systemd.kill KillMode, KillSignal.

villasv commented 5 years ago

Warm shutdown is working since e30e8eddc72078fa2da69ed0c340ec78b5126103 but the autoscaling lifecycle hook timeout alone doesn't yield good results. It's either too short and long process are killed or too long and every machine hangs for a long time.

Gotta implement another side service that issues heartbeats if the pgroup is still active.

villasv commented 5 years ago

Finally closed in the development branch at 465b9b2b0d2c299b2d5654dbf256a82ca938d0f5 🎉