Multiple Questions on Paper/Development

Dear Hongzi,

Following you suggestion, I have opened this issue to share my questions and your answers exchanged via email. Further to the information provided below, I just wanted to clarify whether running the supervised part of the experiment is optional. The way I see it, it is probably better to provide the agent with some kind of heuristic policy to use in order to "kick off" its learning process. Following this assumption, I can see that you have used a specific pkl file generated during the supervised process to feed into the algorithm in the reinforcement learning process. How did you select that? Did you compare the accurracy and error of the training vs. testing sets and used the one with the min difference? Similar, did you choose a specific pkl file generated during the reinforcement learning process, i.e. equal to 1600, since after 1000 iterations the algorithm has already converged? Last, I just wanted to clarify whether using a larger working space is expected to increase the complexity of the algorithm and whether once the backlog size has been reached, any further incoming jobs are simply rejected.

I apologize for all the questions ;) I do hope they help further too. Thank you so much!

Kind regards, A.

Q1: What is the difference between the 1st and the 2nd type of training, i.e. --exp_type=pg_su vs. pg_re? As far as I understand, the 1st one is used to create a sample of experiments num_ex, each of which consists of a number of jobs that have arrived within a given timeframe, also called episode_max_length, performed using the JSF algorithm. The results are then fed into the DeepRM algorithm to re-adjust the weights of the network/parameters. The 2nd one is used to allow the RL algorithm to be trained using the weights of the DNN learned above and the penalties defined.

Answer: You are basically correct. The first type is supervised learning, where we generate the state-action pair from existing heuristics (e.g., SJF) and ask the agent to mimic the policy. The second type is RL training—the agent will explore and see which policy is better and automatically adjust its policy parameter to get larger rewards.

Q2: How did you decide what type of network to use? You have used a DNN with one dense, hidden layer of 20 neurons. Were there any particular reasons for these choices? Have you tried different variations of them?

Answer: We did some parameter search but not too much. As long as the model is rich enough to express strong scheduling policies (e.g., can learn existing heuristics with supervised learning), we will use the network model for RL.

Q3: Was there any problem with overfitting the data and if so, would you have any further suggestions on this issue?

Answer: If the system dynamics change dramatically, there will be overfitting. In our paper, we evaluate on different job combinations but those jobs were generated from the same distribution. You might need to adapt (from a meta-learned policy) or learn a family of robust policies if you need the policy to work well with distribution shift.

Q4: I would expect the input neurons to be equal to: (res_slot + max_job_slot num_nw ). num_res. However, you also take into account the backlog_width and a bias. Could you please justify why have you made that decision and what its purpose is? Also, what the backlog_width represents? I can understand that the backlog is used to store jobs that arrive for service, yet they can not fit in the current working space, but I can not understand whether this is just a number, why it is important to include as an input to the DNN, and why not store the extra jobs in eg a file for later usage.

Answer: The backlogged jobs are just represented as a number. DNN needs to know a rough number of know the current system load. We only provide a number for the neural network to handle. The job information is kept in the environment (it’s just that the agent doesn’t see it).

hongzimao / deeprm

Multiple Questions on Paper/Development #8

Kind regards, A.