The model training issue with reward function optimizing makespan

zhangsj0608 commented 3 years ago

Hi, Hongzi

I noticed your code supports the makespan-optimized policy by setting args.learn_obj to 'makespan'. However, when trained with the recommended small scale setting (200 stream jobs on 8 agents) in 3000 episodes, the model doesn't seem to converge as it normally does with objective of avg JCT. The following figures demonstrate the actor_loss and average_reward_per_second collected during training. The average_reward_per_second is always around -1, which is due to the reward is the same as negative makespan (equal to total time to be divided by). Could you suggest the setting that is maybe missed to guarantee the convergence? avg_reward_per_sec actor_loss

hongzimao commented 3 years ago

Ok we need to debug this - it's been a while since I trained with the makespan reward. The reward calculation is here https://github.com/hongzimao/decima-sim/blob/c010dd74ff4b7566bd0ac989c90a32cfbc630d84/spark_env/reward_calculator.py#L32-L34. You should check if the reward at each action checks out with this reward calculation.

The learning curve you show is helpful - it shows the agent doesn't get any learning signal, the actor loss is essential 0 (in the scale of 1e-11). It's likely that the reward agent gets is all 0 or all constant. Somewhere the reward assignment to the action is off.

I will try to squeeze some time to run the code myself too - but could you run it and print out the reward to start debugging? Thanks!

zhangsj0608 commented 3 years ago

Hi, there

I have tried for a couple of times with careful setting on args, however, the problem persists. I suppose the reward (shown in line 33 above) is obtained with a static time interval, e.g. from last scheduling to current scheduling step. Then the long-term return is calculated by summation of them, which is actually the time point of the final scheduling step. It is not the makespan of all jobs, as some might be still running after the scheduling. Just guess it's because the reward does not reflect the accurate makespan at all. The problem seems to be what the proper reward function that reflects the metric (makespan) should be like?

hongzimao commented 3 years ago

For makespan, it only makes sense to run a fixed batch of jobs (i.e., no new arrival of jobs). In your settings, did you set these settings --num_stream_* to 0 and only use --num_init_dags?

zhangsj0608 commented 3 years ago

Hi, Hongzi,

It might be the problem, as I am not aware of the effects of stream jobs in the system on make-span. Actually, I kept them 200 (num stream jobs) each episode. I will quicly figure it out and see the result.

Nannnnnn commented 3 years ago

Hi, Hongzi

I noticed your code supports the makespan-optimized policy by setting args.learn_obj to 'makespan'. However, when trained with the recommended small scale setting (200 stream jobs on 8 agents) in 3000 episodes, the model doesn't seem to converge as it normally does with objective of avg JCT. The following figures demonstrate the actor_loss and average_reward_per_second collected during training. The average_reward_per_second is always around -1, which is due to the reward is the same as negative makespan (equal to total time to be divided by). Could you suggest the setting that is maybe missed to guarantee the convergence?

Hi Zhang! It seems that you have built up the enviorment successfully. May I know the SW version(e.g. tf verison, python version) you have for setting up the enviroment? I tried it but found some libs are missing. Thanks in advance!

zhangsj0608 commented 3 years ago

Hi, there

To build the enviroment needs no more operations than jus cloning the whole repository. To refer, my tf.version is 1.13 and python version is 3.6.

zhangsj0608 commented 3 years ago

Hi, Hongzi

During past days, I retrained the model with suggested settings, e.g. num_init_dags > 0 and num_stream_dags = 0. The detailed instruction is as follow.

nohup python3 train.py --exec_cap 25 --num_init_dags 100 --learn_obj 'makespan' --num_stream_dags 0 --reset_prob 5e-7 --reset_prob_min 5e-8 --reset_prob_decay 4e-10 --diff_reward_enabled 1 --num_agents 4 --model_save_interval 100 --num_ep 3005 --model_folder ./models/batch_100_job_diff_reward_reset_5e-7_5e-8_makespan_ep3000/ > out.log 2>&1 &

However, the average reward collected by the agent is still -1 during training. I feel the function (line 33-34) used by reward calculator may just give a static signal over time. Any suggestions on it?

hongzimao commented 3 years ago

We may have to print the reward values and examine it. Just start from the bare minimum, try using num_init_dags = 1 and num_stream_dags = 0. Log all the reward values for the actions to finish this single job. Could you check if the reward you get corresponds to the this job completion time? After checking this simple scenario, we can move to two jobs, and multiple jobs. Based on what you showed, there might be some bugs with the current code for this makespan reward. Thanks!

Nannnnnn commented 3 years ago

Hi, there

To build the enviroment needs no more operations than jus cloning the whole repository. To refer, my tf.version is 1.13 and python version is 3.6.

Thanks! May I know the HW setup you have? I am trying a CPU version(unfortunatley I don't have a qulified GPU) but it is halting at

Nannnnnn commented 3 years ago

Hi there, I have a question regarding the number a agents. Do you know the reason to have multiple agents, e.g. args.num_agents = 16 by default.

hongzimao commented 3 years ago

When the program halts, is there an error message?

Multiple agents are just for speeding up the training. Parallel agents (threads on CPUs) generate experience concurrently. You can set args.num_agents based on the number of CPUs you have on your machine.

Nannnnnn commented 3 years ago

Hi Hongzi, thanks! There is no error message at all but some warnings(related to some python libarary funcitons) which seems not to critical. Since I am trying it with CPU version, I suppose that it takes too long to train it which looks like that the program stops. Could you share a bit on the tranning time you had before?

hongzimao commented 3 years ago

You might find this useful regarding the training time https://github.com/hongzimao/decima-sim/issues/21

Also, we provided a trained model if you find the training time too long https://github.com/hongzimao/decima-sim/issues/12

jahidhasanlinix commented 2 years ago

@zhangsj0608 @Nannnnnn hi, would you like share the code part that you used to plot those figure. I need help on that part to plot those figures that used in Decima paper. I can't generate any figure so far. Can you please help sharing those code part to plot as like paper. Thank you.

hongzimao / decima-sim

The model training issue with reward function optimizing makespan #31