stanford-futuredata / gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
MIT License
125 stars 31 forks source link

Fixes for physical cluster experiments #168

Closed santhnm2 closed 4 years ago

deepakn94 commented 4 years ago

Do we want to merge this? I think we should maybe do a couple of simulation checks just to make sure things are still ok if so.

santhnm2 commented 4 years ago

I do want to merge this but it will result in values not being 100% reproducible from what we submitted (e.g. due to rounding up the steps in distributed jobs to nearest multiple of scale factor, updated recommendation throughputs). Also there may be some extraneous files - for example, I'm not sure what we want to do with oracle_throughputs_v3.json (these are the throughputs I ended up using for the final max_min_fairness and max_min_fairness_perf runs). I'll take a closer look myself sometime today or (more likely) tomorrow and come up with a list of potential issues and then we can decide how to resolve them.

deepakn94 commented 4 years ago

I think this looks okay modulo the throughput file changes

deepakn94 commented 4 years ago

Once we resolve what we're doing with the throughput files, I can cross-check with the runs we performed for OSDI.