hongzimao / decima-sim

Learning Scheduling Algorithms for Data Processing Clusters
https://web.mit.edu/decima/
286 stars 90 forks source link

Question and answer regarding Decima paper #6

Open tegg89 opened 4 years ago

tegg89 commented 4 years ago

Question:

According to the appendix section in the paper, you used supervised learning to train graph neural networks for the sanity check. I presume the target (label) is the critical path which is stated in the JobDAGDuration class from the job_dag.py file. While training the code, this class is ignored.

• So, does the GNNs (GCN & GSN) are followed by the unsupervised learning scheme? • In that sense, the GNNs act as preprocessing to capture local/global summaries of the loaded jobs, and I believe the code is running based on the fixed numbers of input jobs. Is there any way to handle various numbers of incoming jobs?

====================================================================== Hongzi's answer:

The appendix experiment is just to make sure the GNN architecture at least have the power to express existing heuristics that used critical path. In the main paper, Decima scheduling agent is trained end-to-end with reinforcement learning. This includes the weights of the GNN (since the entire neural network in figure 6 is trained together). Therefore, as expected, the main training code won’t invoke the critical path module during training.

Also, Decima’s GNN handles variable number of jobs by its design. Please notice that the default training in our code is with streaming type of jobs (jobs keep coming into the system) with flag --num_stream_dags 200. The section 5.1 in our paper explained in details why this design is scalable to arbitrary DAG shape and size.

hongzimao commented 4 years ago

Thanks for sharing!

tegg89 commented 4 years ago

Question:

  1. In which way did you add curriculum learning? How much this methodology gives an impact on the performance?
  2. How long to train the Decima agent in which hardware specification?

====================================================================== Hongzi's answer:

The curriculum learning happens with the decay of "reset_prob”, via the parameter --reset_prob_decay. In our experiment, this saves us the training time quite a bit because we don’t have to train over long episodes in the beginning of the training phase. You might want to play with this parameter for your problem to find the fastest convergence that leads to same eventual performance.

We didn’t do too much optimization and the last time we ran the released code on CPU takes 5 days to converge.