stanford-futuredata / gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
MIT License
124 stars 31 forks source link

Questions about the simulation #238

Closed Rivendile closed 2 years ago

Rivendile commented 2 years ago

Hi, is there any constrains on the traces used for the simulation, e.g., the arrival time and the steps? I use a randomly generated trace and get the following error infomation:

Traceback (most recent call last):
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
    main(parser.parse_args())
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
    jobs_to_complete=jobs_to_complete)
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
    scheduled_jobs = self._schedule_jobs_on_workers()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
    self._update_priorities()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
    time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'
Traceback (most recent call last):
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 95, in <module>
    main(parser.parse_args())
  File "scripts/drivers/simulate_scheduler_with_trace.py", line 56, in main
    jobs_to_complete=jobs_to_complete)
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 1464, in simulate
    scheduled_jobs = self._schedule_jobs_on_workers()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 870, in _schedule_jobs_on_workers
    self._update_priorities()
  File "/opt/tiger/gavel/scheduler/scheduler.py", line 2393, in _update_priorities
    time_since_last_reset = current_time - self._last_reset_time
TypeError: unsupported operand type(s) for -: 'NoneType' and 'float'

Besides, does the scale factor mean the number of servers used? What if there are 8 GPUs on one server and the job requires 1/2/4GPUs?

Rivendile commented 2 years ago

I used the fifo policy, ideal=False, and found that the priorities of jobs are changed to 0.0 after a while.

deepakn94 commented 2 years ago

Hello, thank you for your question! Can you please paste the full command line you used that produced this error?

Rivendile commented 2 years ago

Hi, the command line I used is

python3 -u scripts/drivers/simulate_scheduler_with_trace.py -t trace.trace  -p fifo --seed 0 -c 8:0:0 --num_gpus_per_server 8:1:1 

The trace.trace is as follows:

Transformer (batch size 16) test    ../workloads    --iters0    1   921 1   1   -1.000000   0
Transformer (batch size 16) test    ../workloads    --iters0    1   153 8   1   -1.000000   30
ResNet-18 (batch size 32)   test    ../workloads    --iters0    1   651 2   1   -1.000000   53
ResNet-50 (batch size 128)  test    ../workloads    --iters0    1   1110    1   1   -1.000000   79

I tried to trace the error, and found that the priorities are changed to 0.0 after a while. The jobs with priorities 0.0 will not be considered when scheduling. Any suggestions about fixing this?

Additionally, I changed the elif in https://github.com/stanford-futuredata/gavel/blob/40a22a725f2e70478483e98c9b07c6fc588e0c40/scheduler/scheduler.py#L1269 to if, or it would loop infinitely.

deepakn94 commented 2 years ago

Ok, I'll look into it.

Have you tried one of the traces we already have? (such as https://github.com/stanford-futuredata/gavel/blob/master/scheduler/traces/physical_cluster/debug.trace)

Rivendile commented 2 years ago

Thanks for your reply. I tried the traces you mentioned, and found no bug. Is there anything I should notice for generating the trace?

deepakn94 commented 2 years ago

How did you generate this trace? What are you trying to do with Gavel?

Rivendile commented 2 years ago

The 4-job trace is a minimum version with running error from a trace generated according to Philly Trace by Microsoft. We are doing some research about cluster scheduler, and we’d like to compare Gavel with our work.

deepakn94 commented 2 years ago

Ok, I see. Makes sense. I will look a bit more into your trace. Nothing looks obviously wrong, but there must be something subtle that is off.

Will get back to you by the end of the weekend!

Rivendile commented 2 years ago

Thanks! Hope to hear from you soon.

deepakn94 commented 2 years ago

Hi @Rivendile, I pushed a commit with a few fixes. Your trace should work now on master; let me know if you run into any difficulties.

Rivendile commented 2 years ago

Thanks for your reply. I will try it.

deepakn94 commented 2 years ago

Going to mark this as closed. Feel free to re-open if you are still running into issues!