stanford-futuredata / gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
MIT License
124 stars 31 forks source link

Question: Understanding structure of throughputs.json files #240

Closed nayakajay closed 2 years ago

nayakajay commented 2 years ago

I wanted to understand a bit about the structure of xxx-throughputs.json files present in the repository. For example, in simulation_throughputs.json

ResNet-18 (batch size 16)', 1)": {
            "null": 4.795294551566172,
            "('ResNet-18 (batch size 32)', 1)": [
                2.539979567443098,
                3.1201925448827033
            ]
  1. What are the two values in the array with key 'ResNet-18 (batch size 32)', 1?
  2. What does the null key represent? It would be great if you could also provide details on how you collected/generated these files so that it can be reproduced for a GPU not present in the repository (say, Turing).
deepakn94 commented 2 years ago

Hi @nayakajay, thanks for the question!

The key for the outer dictionary is the model (along with the batch size) and number of GPUs. The key for the inner dictionary is the second model in case of co-location ("null" means the throughput of the ResNet-18 model in isolation with a batch size of 16 is about 4.79 iterations/second). The co-located models also have a model name and the number of GPUs (Gavel assumes that models can only be co-located with models using the same number of GPUs).

You can collect these files by just benchmarking a couple 100 iterations of each desired model, and measuring the average time per iteration (from which you can compute the throughput in iterations/second).

Let me know if you have any other questions!

nayakajay commented 2 years ago

Thanks @deepakn94, for the response. To confirm, in the example, 2.54 is the throughput of ResNet-18 (batch size 32)', 1 when co-located with ResNet-18 (batch size 16)', 1, and 3.12 is the throughput of ResNet-18 (batch size 16)', 1 when co-located with ResNet-18 (batch size 32)', 1? Or is it the other way around?

deepakn94 commented 2 years ago

The other way around: 2.54 is the throughput of (ResNet-18 (batch size 16), 1) and 3.12 is the throughput of (ResNet-18 (batch size 32), 1).

nayakajay commented 2 years ago

Thanks @deepakn94. Closing this issue now.