Kill unresponsive jobs at the end of the round

stanford-futuredata / gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020

MIT License

125 stars 31 forks source link

Kill unresponsive jobs at the end of the round #222

Closed santhnm2 closed 4 years ago

deepakn94 commented 4 years ago

Code seems good -- I presume we want the test to finish?

santhnm2 commented 4 years ago

Code seems good -- I presume we want the test to finish?

Yea ideally any distributed initialization error will just be a one-off issue and then we can simply try scheduling the job again - if a job fails completely however then that might invalidate the run :/