[feature] Timeout for batch jobs

sheldonkwok commented 7 years ago

I'm currently running a handful of periodic batch jobs. It's great that nomad doesn't schedule another one if a current one is still running. However, I think it would be helpful to stop a batch job on timeout if it's running beyond a set time. Maybe there could be a script run on timeout or the script itself would just have to handle the signal.

dadgar commented 7 years ago

Hey Sheldon,

You could accomplish this yourself by putting a little script in-between what you actually want to run that waits till either the task finishes or the timeout and then returns exit 1

sheldonkwok commented 7 years ago

That's how I'm handling it right now but I was thinking it would be cool if Nomad could do it. I understand if it seems like bloat though :)

OferE commented 7 years ago

+1 - important feature for batch runs. Not so clean to handle this ourselves

alxark commented 6 years ago

I think this function should be available not only for batch jobs but also for regular services, this will help us to implement "chaos monkey" function right inside nomad. This will increase system stability, because it will be ready for downtime of any service.

jippi commented 6 years ago

As mentioned in Gitter chat, the timeout binary in coreutils can do this inside the container if you need a fix right now.

timeout 5 /path/to/slow/command with options

alxark commented 6 years ago

I think it will be better to add "max_lifetime" and add ability to specify it as a range or concrete value. For example 10h-20h means that daemon might be killed in 11h or after 19h, but maximum time will be 20h. Implementing chaosmonkey in such way will be great feature in my opinion, and you don't need any 3rd party apps =)

shantanugadgil commented 6 years ago

If a timeout function is implemented it can be used to mimic the classic HPC schedulers like PBS, TORQUE, SGE, etc.

Having it as a first-class feature would be indeed useful for many folks including me! Hope this does get implemented.

Thanks and Regards, Shantanu

mlehner616 commented 6 years ago

Just adding a use case here. Let's say I have an app and implement a timeout inside the app. Let's assume there's a bug in this app which causes it to hang occasionally under certain conditions and it thus never reaches it's hard coded "timeout" because it's essentially stopped responding. We should have a way in the infrastructure to automate a relatively simple, drop dead timeline where the scheduler will kill a task that's unresponsive.

Nomad is better equipped to provide that failsafe protection at the infrastructure level than rewriting timeout code in every app simply because it doesn't rely on the app's runtime to perform that kill.

shantanugadgil commented 6 years ago

I agree, but as a temporary workaround, how about the timeout with the kill timout parameter?

http://man7.org/linux/man-pages/man1/timeout.1.html

Miserlou commented 5 years ago

+1 for this, it's basic functionality for a job scheduler. Amazing this doesn't exist. @mlehner616 is obviously correct about why having the timeout checker inside the container itself is a boneheaded recommendation. We got bit by 3 hung jobs out of 100,000 that prevented our elastic infrastructure from scaling back down, costing a nice chunk of change.

AndrewSav commented 5 years ago

@Miserlou as mentioned earlier in this thread a workaround would be to wrap your app in a timeout script. There is an example of how you can do it above. That might save your beckon in the scenario you described.

onlyjob commented 5 years ago

Timeout for batch jobs is an important safeguard. We can't rely on jobs' good behaviour... Job without time limit is a service hence timeout is crucial to constrain buggy tasks that might be running for too long...

wiedenmeier commented 5 years ago

I'd also very much like to see nomad implement this, for the use case where nomad's parameterized jobs are used as a bulk task processing system, similar to the workflow described here: https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch

There are major advantages for us to use this workflow, as it takes advantage of infrastructure already in place to handle autoscaling, rather than having to set up a new system using celery or similar task queues. The lack of a built in timeout mechanism for batch jobs makes the infrastructure required for this, fairly common (afaik) use case quite a bit more complex.

Handling the timeout in the tasks themselves is not a safe approach, for the reasons mentioned above, and would also increase the complexity of individual tasks, which is not ideal. Therefore the dispatcher must manage the timeout itself and kill batch jobs once it has been reached. This makes it inconvenient to manage jobs which need different timeouts using a single bulk task management system, as configuration for these needs to be stored centrally, separate from the actual job specification.

There are workarounds for this, but it would be very nice to see nomad itself handle timeouts, both for safety and to simplify using nomad.

Miserlou commented 5 years ago

@Magical-Chicken - I strongly, strongly recommend you avoid using Nomad for that purpose. There are claims made in that blog post which are simply untrue, and many people are being duped by it.

See more here: https://github.com/hashicorp/nomad/issues/4323#issuecomment-426419394

wiedenmeier commented 5 years ago

@Miserlou Thanks for the heads up, that is a pretty serious bug in nomad, and is pretty concerning since we have a large amount of infrastructure managed by it. The volume of dispatches we are handling currently isn't too high, so I'm hoping nomad will be ok to use here in the short term, but long term I will definitely consider switching this system over to a dedicated task queue.

Nomad will crash with out of memory Hopefully hashicorp intents to fix this, maybe they could add a configuration option for servers to use a memory mapped file to store state rather than risking an OOM kill, or even have servers start rejecting additional job registrations if they're running out of memory. There's really no case where it is acceptable for servers to crash completely, or secondary servers to fail to elect a new leader after the leader is lost.

jxgriffiths commented 5 years ago

+1 are the any plans to include this feature any time soon? Seems pretty important. Wrapping tasks in a timeout script is a bit hacky.

epetrovich commented 5 years ago

+1

grainnemcknight commented 5 years ago

+1

sabbene commented 4 years ago

A job run limit is an essential feature of a batch scheduler. All major batch schedulers (PBS, slurm, LSF, etc) have this capability. I’ve seen a growing interest in a tool like nomad, something that combines many of the features of a traditional batch scheduler with K8. But without a run time limit feature, integration into a traditional batch environment would be next to impossible. Is there any timeline on adding this feature to nomad?

karlem commented 4 years ago

+1

shantanugadgil commented 4 years ago

@karlem you should add a +1 to the first post rather that a seaprate message. That's how they track the demand of a feature.

If you know more folks who might be interested in this, you should encourage them to do so as well! 😉

BirkhoffLee commented 2 years ago

The absence of this feature just killed my cluster. A curl periodic job piled up to 600+ pending and tens running. This caused very high disk i/o usage from nomad and effectively rendered affected nodes totally unresponsive. Then Consul decided to stop working as well, because i/o timeout from other nodes.

Of course you could argue that curl has in-built timeout options, the point is that if a task scheduler does not provide this feature, there is no simple and unified way to keep all jobs organised and safe if they can decide on their own on how much time they want to run.

smaeda-ks commented 2 years ago

GitHub Actions Self-hosted runner with autoscaling is another good example, I thought. It's very much possible to run runners on Nomad as batch jobs, and autoscale them using parameterized batch jobs, so tasks can be easily dispatched, and can be triggered by GitHub Webhooks upon receiving the queued events. And having max lifetime support would be a great safeguard for such dynamic job scheduling integrated with third-party systems.

danielnegri commented 1 year ago

+1

NickJLange commented 1 year ago

While a safety catch (timeout) is definitely a gap in the product - I don't think it captures the use-case I'm looking for in #15011. I am looking for a stanza to run my type=service job M-F, from 08:00-20:00 with a user-defined stop command when a driver supports it.

shantanugadgil commented 1 year ago

Today I hit this for Docker jobs. Out system is full of Docker cron jobs and one job was stuck for 20 (twenty) days. :woozy_face:

Without the timeout parameter, could there be some other "systematic" way to detect stuck (or jobs running for too long)?

jxgriffiths commented 1 year ago

We have some python scripts that hit the allocations api to get this type of information and trigger alerts / remediations in our monitoring system.

This may be a little different using the docker driver / your implementation but we just reverse sort allocations by CreateTime and look for a specific prefix (xxx-periodic) where xxx is the job name. This tells us when the last allocation happened.

On Thu, Oct 27, 2022 at 1:18 AM Shantanu Gadgil @.***> wrote:

Today I hit this for Docker jobs. Out system is full of Docker cron jobs and one job was stuck for 20 (twenty) days. 🥴

Without the timeout parameter, could there be some other "systematic" way to detect stuck (or jobs running for too long)?

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/1782#issuecomment-1293162068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ2XYVW2VT7Q3XA3N3NI6DWFI3EPANCNFSM4CRUCDLQ . You are receiving this because you commented.Message ID: @.***>

shantanugadgil commented 1 year ago

@jxgriffiths thanks for the idea ...

Since your post we have been putting together a standalone Nomad checker job which will go through all batch jobs and to figure out "stuck" allocations.

The allocation based search was easy enough using a combination of curl, jq, date and bash. (wanted to avoid python as much as possible)

We also ended up putting together a jobs endpoint query for figuring out pending jobs too, but I think that is easily discoverable via metrics.

The subsequent question was how to individually tune the timeout for each job.

What we have done for this is to add a job level meta parameter, which the checker job will use as the configuration parameter to eventually kill the particular job.

In case one has multiple groups/tasks in a batch job, one could also move the meta down into the groups or tasks as per requirement.

hashicorp / nomad

[feature] Timeout for batch jobs #1782