hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.94k stars 1.96k forks source link

Provide for dependencies between tasks in a group #419

Closed mfischer-zd closed 2 years ago

mfischer-zd commented 9 years ago

Tasks in a group sometimes need to be ordered to start up correctly.

For example, to support the Ambassador pattern, proxy containers (P[n]) used for outbound request routing by a dependent application may be started only after the dependent application (A) is started. This is because Docker needs to know the name of A to configure shared-container networking when launching P[n].

In the first approximation of the solution, ordering can be simple, e.g., by having the task list in a group be an array.

yishan-lin commented 5 years ago

Hi everyone - thank you for the patience.

We are working on implementing native task dependencies now and are exploring a potential Airflow integration.

Would love support in adding feedback + your interest in this ticket to the Apache Airflow committee so they may understand the demand. Ideally, we'd like to optimize the experience by providing a first-class integration, rather than a maintained fork.

https://issues.apache.org/jira/browse/AIRFLOW-5633

cc @jazzyfresh

CarlosDomingues commented 5 years ago

@yishan-lin a Nomad executor for Airflow would be absolutely brilliant.

sfs77 commented 4 years ago

in watching, and expect

sagarrakshe commented 4 years ago

I faced the similar issue in our deployments, so I created a tool. https://github.com/sagarrakshe/nomad-dtree

DhashS commented 4 years ago

We needed this enough that we implemented it ourselves. We have an AST for nomad jobs and interpret it to figure out which consul health checks to watch, wait for their success/fail timeout, and add the unblocked jobs to the work queue.

recursionbane commented 4 years ago

Agreed, we could not wait, either.

We ended up writing a DAG parser to evaluate eligibility of a node based on complex boolean dependencies, only exposing eligible nodes to Nomad for scheduling.

Not ideal, since we are now reliant on a single-threaded process for scheduling, but we are able to schedule several thousand jobs per minute this way. This might pay off in the long term, since it is unlikely Nomad's dependency roadmap includes boolean/complex dependencies.

yishan-lin commented 4 years ago

Hey all, for those that missed our Nomad Virtual Day livestream last week - task dependencies is coming in Nomad 0.11, which folks will hear more about it in the coming weeks.

Here is a recording of the wonderful demo and presentation for reference that @jazzyfresh did on the feature - https://www.hashicorp.com/resources/preview-of-nomad-0-11-task-dependencies

For more complex dependencies as @recursionbane mentioned, we are targeting an integration with Apache Airflow to support such functionality.

eigengrau commented 4 years ago

That’s great news. @jazzyfresh I have a question related to this issue: I presume if we wanted to have a database server up and running before the main task, we would declare it as a pre-start, sidecar task in Nomad v0.11. Does the new lifecycle-hook mechanism observe the Consul health-check of the database service before moving on with the main lifecycle phase? Or would we need to leverage Apache AirFlow for this?

DhashS commented 4 years ago

@yishan-lin that's awesome! Prestart and Poststop hooks are definitely not just a nice-to-have, and i'm super happy that you added them.

However, i don't think that those hooks count as "task dependencies". Consider a group with 5 containers, one that needs to run before (prestart), one that needs to run after (poststop), and the other three containers need to be brought up in sequence. Prestart and poststop partition the scheduling space into 3 chunks, not N chunks like a true "task dependencies" addition would.

An example of this is how we bring up ZK/Kafka in our software (we run them on nomad with host volumes). We have to submit two different jobs since there's no way to have "generic" task dependencies, so we're forced to wait until ZK's health check comes back before submitting the kafka job. True task dependencies would allow us to coalesce them into one job.

yishan-lin commented 4 years ago

Hey Dhash - you and I synced on this offline but recapping it here for visibility for all. The 5 container group example you mentioned is the kind of DAG functionality that I'd look for our Apache Airflow integration to cover, which is on our roadmap and coming soon!

DhashS commented 4 years ago

Our use case has been worked around well by the use of consul_service_health and nomad_job in terraform.

We now use terraform to submit all our nomad jobs, and the wait_for parameter in the consul_service_health allows the data dependency to the next nomad job to not be fulfilled until all checks are passing

evandam commented 3 years ago

Hey @yishan-lin, I was just curious if there are any updates on the Airflow integration? We would love to see a Nomad executor!

retarpt commented 3 years ago

Hi, does anyone here have experience using Nomad for scheduling Airflow tasks (or vice-versa)? I am looking to constrain resources of individual tasks within an Airflow DAG by isolating them with cgroups and namespaces provided by Nomad's exec driver. Any help, resources, or advice would be so very much appreciated! Thank you, all.

Oloremo commented 3 years ago

Interested in that as well

ahmedwonolo commented 2 years ago

Any update on this?

tgross commented 2 years ago

Nomad lifecycle hooks have shipped for a while now. There's an open issue still for cross-job dependencies https://github.com/hashicorp/nomad/issues/545 that covers the other use cases described here. That's a bigger project and one we've had some discussions about, but it's not on our immediate roadmap either.

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.