Closed Rivendile closed 2 years ago
I used the fifo policy, ideal=False, and found that the priorities of jobs are changed to 0.0 after a while.
Hello, thank you for your question! Can you please paste the full command line you used that produced this error?
Hi, the command line I used is
python3 -u scripts/drivers/simulate_scheduler_with_trace.py -t trace.trace -p fifo --seed 0 -c 8:0:0 --num_gpus_per_server 8:1:1
The trace.trace is as follows:
Transformer (batch size 16) test ../workloads --iters0 1 921 1 1 -1.000000 0
Transformer (batch size 16) test ../workloads --iters0 1 153 8 1 -1.000000 30
ResNet-18 (batch size 32) test ../workloads --iters0 1 651 2 1 -1.000000 53
ResNet-50 (batch size 128) test ../workloads --iters0 1 1110 1 1 -1.000000 79
I tried to trace the error, and found that the priorities are changed to 0.0 after a while. The jobs with priorities 0.0 will not be considered when scheduling. Any suggestions about fixing this?
Additionally, I changed the elif in https://github.com/stanford-futuredata/gavel/blob/40a22a725f2e70478483e98c9b07c6fc588e0c40/scheduler/scheduler.py#L1269 to if, or it would loop infinitely.
Ok, I'll look into it.
Have you tried one of the traces we already have? (such as https://github.com/stanford-futuredata/gavel/blob/master/scheduler/traces/physical_cluster/debug.trace)
Thanks for your reply. I tried the traces you mentioned, and found no bug. Is there anything I should notice for generating the trace?
How did you generate this trace? What are you trying to do with Gavel?
The 4-job trace is a minimum version with running error from a trace generated according to Philly Trace by Microsoft. We are doing some research about cluster scheduler, and we’d like to compare Gavel with our work.
Ok, I see. Makes sense. I will look a bit more into your trace. Nothing looks obviously wrong, but there must be something subtle that is off.
Will get back to you by the end of the weekend!
Thanks! Hope to hear from you soon.
Hi @Rivendile, I pushed a commit with a few fixes. Your trace should work now on master; let me know if you run into any difficulties.
Thanks for your reply. I will try it.
Going to mark this as closed. Feel free to re-open if you are still running into issues!
Hi, is there any constrains on the traces used for the simulation, e.g., the arrival time and the steps? I use a randomly generated trace and get the following error infomation:
Besides, does the scale factor mean the number of servers used? What if there are 8 GPUs on one server and the job requires 1/2/4GPUs?