Open lgfa29 opened 3 years ago
You had asked for feedback in #61 . I've been using the community cron plugin for a day for a use-case similar to your example above. It works well for me so far (after some finagling since I operate nomad-autoscaler
as a Docker image).
While that plugin's cron notation is very simple: period_business = "* * 9-17 * * mon-fri * -> 5"
, I prefer to think in terms of the start
and end/duration
you present here. Like it is hard to use that line to express 9:00:00 to 16:30:00.
Adding clock scheduling as a first-class enable_schedule
feature is much more powerful. As you show, you can then apply different kinds of policies, rather than just modifying a count.
One difference in our use case, relative to your example, is that we start/stop the services daily. That would be more difficult to express if we only had start/end intervals (we would need to configure multiple intervals), but start+duration fixes that. Outside that interval, I want the count to be 0, which I think would happen if I configure the group's count = 0
?
I note that hashicorp/cronexpr has this warning:
As of now, the behavior of the code is undetermined if a malformed cron expression is supplied
So if this were to be a production plugin, one would expect either that repo to fix that issue or the plugin to validate the cron entries.
When documenting this, be sure to indicate how timezones work in the cron expression... is it UTC or local? System cron
works on localtime. Maybe requiring UTC expression simplifies working with regions and datacenters.
Thank you for the feedback @neomantra, this is very helpful. And very good point about the timezones, in general I would say they always be UTC based.
One difference in our use case, relative to your example, is that we start/stop the services daily. That would be more difficult to express if we only had start/end intervals (we would need to configure multiple intervals)
I think this would be covered by these two blocks from the example?
# Scale to 10 instances from 9:00AM to 4:30PM.
check "market_hours" {
enabled_schedule {
start = "0 9 * * *"
end = "30 16 * * *"
}
strategy "fixed-value" {
count = 10
}
}
# Scale to 3 instances from 4:30PM to 9:00AM.
check "off_market_hours" {
enabled_schedule {
start = "30 16 * * *"
end = "0 9 * * *"
}
strategy "fixed-value" {
count = 3
}
}
Except that you would have count = 0
as you mentioned (though scaling to 0 may cause Nomad to consider the job as dead
and cuase some issues, but I would need to double check).
@lgfa29 I see now, I'm not sure what I was thinking then.
How would conflict resolution work? An obvious one would be overlapping start/end
s with fixed-value: count=3
and the other with fixed-value: count=5
... but it gets more complex with the general strategy
stanza. Would the Nomad scheduler just complain and the operator would have to resolve it somehow?
It would be hard to confirm the impact in a pre-flight check, besides the naive "no overlapping start/stop schedules"; maybe do that and have some override for the advanced uses.
How would conflict resolution work?
Conflict resolution would be handled the same way it's currently done, with the safest check
being picked. From our docs:
The checks are executed at the same time during a policy evaluation and the results can conflict with each other. In a scenario like this, the autoscaler iterates the results the chooses the safest result which results in retaining the most capacity of the resource.
Is this feature implemented or do we have to use the community plugin for this?
Would love to see this as native functionality as well. We currently scale down to 0 at the end of each business day and then back up in the morning, but the autoscaler prevents this by scaling this back up to the target level every time the autoscaling group tries to scale down per it's schedule.
Hi @patademahesh π
No, this feature has not been implemented yet. I have not used the plugin, so I can't provide any guidance there, but you should give it a try and see if meets your needs. We initially tried to implement this as a plugin but we found some use cases that would be possible to handle, so this would require some changes in the Autoscaler core.
@schematis that's the main use case we envision for this, but we haven't had the chance to properly roadmap it yet. If you haven't already, don't forget to add a π to the original message so we can properly gauge interest π
Hello @lgfa29 , maybe a good thing to put in this issue is also to be able to restart a job in a specific time.
Eg.:
In trading we need an app (raw_exec) to be restarted precisely at (as example) 09:58 am. Right now we're doing it with an in-house cron-like scheduler, but if we use this autoscaling feature described in this issue would suggest that we need to have a scale down to 0 at 9:57 and scale up to 1 at 9:58, which would make a minute without the app running., which is not good.
Something like:
# Restart app at 9:58 every business day
schedules {
restart = "58 9 * * 1-5" # βAt 09:58 on every day-of-week from Monday through Friday.β
}
Hi @caiodelgadonew π
I think job restart falls outside the scope for the autoscaler. I would also be worried about using the autoscaler, an eventually consistent system, to perform time sensitive tasks. Even with this proposal we can't really guarantee that your policy will be executed exactly at 9:57 since policies are evaluated in an interval (grey arrows in the diagram) that may not be aligned with the specific time you need.
For scheduled one-shot operations a periodic
job may be more appropriate.
I agree with you, maybe not a topic for here but something missing on the ui is the possibility to "restart all job allocs" something that can be done by the cli but in the ui we need to go through all allocs and restart them one by one, will check later if there's that in the API
There's no API for that yet because async coordinated alloc restart are very tricky to implement, that's why the nomad job restart
command implements this logic client-side (from your terminal). The PR covers a little bit about this: https://github.com/hashicorp/nomad/pull/16278.
If all you need is a simple loop of restarts you can use the /v1/job/:job_id/allocations
and then call /v1/client/allocation/:alloc_id/restart
on each of them.
Autoscaling is usually a response action to some observed change in workload. But in some scenarios, the workload change has a well-defined and predictable periodicity. For these types of load, being able to preemptively schedule changes would be very useful.
The schedule-based autoscaling feature will allow operators to control a time window for when policies or individual
check
s are enabled or disabled. Policies are still evaluated in the interval defined byevaluation_interval
attributed, but when the evaluation falls outside this time window, the policy orcheck
will have no effect.This will be done using a new block called
enabled_schedule
that can be placed inside apolicy
orcheck
block. This new block will take astart
cron expression that defines when the enabled time window starts. To define the end limit of the window, either aend
cron expression or aduration
string formatted as a Go duration can be passed.The following examples define the same time window for when this policy is enabled: Mondays through Fridays from midnight to 11:59PM.
This approach allows operators to use the strategy that best fits their use-case. A policy inside a job file would look like as follows: