Closed zhanggh900921 closed 6 years ago
Thanks for the effort of reproducing the results!
For your first figure, "...and the ENTROPY_WEIGHT=0.5", when you load a model and do testing, there is no need to set an entropy weight. The entropy only affects the exploration during training. Therefore, you want the entropy to be large in the beginning of the training phase and then decay it to a small value.
As for your optimization beyond our pre-trained model, I'm surprised that it doesn't beat the pre-trained model (sim_rl~43 in the figure). The best performance others achieve I heard is around 47 (which outperforms us, I'm trying to reproduce that too) in linear QoE. Can you let me know the exact steps you used to do the training? How many iterations did you do; how did you change the entropy weight; what set of traces did you use for training, etc.?
About the results we reported, the largest performance gain we observed is with QoE_hd. To reproduce that result, a sanity check you can do after you train a model is figure 3b. You should see the agent almost always alternates between those bitrate levels but not something in between. If your agent achieves that, you should observe similar performance gain as in the paper.
Hope these help.
Dear Hongzi:
Thanks for your reply.
Training set: train_sim_traces Validation set: test_sim_traces
0-19999 epoch ENTROPY_WEIGHT = 5 20000-39999 epoch ENTROPY_WEIGHT = 1 40000-79999 epoch ENTROPY_WEIGHT = 0.5 80000-99999 epoch ENTROPY_WEIGHT = 0.3 100000-120000 epoch ENTROPY_WEIGHT = 0.1
Steps for training:
Is there any wrong of my training? Thanks a lot.
These settings look reasonable. How do you pick the "best validation performance" (btw, test set shouldn't be validation set; but it is okay for debugging purpose)?
One other caveat is the data in dropbox link is a subset ("sample" data) of what we used for full training. Giving the agent more diverse data helps it learn better and more robust model. To generate more data, you can use the code in trace/
. You might want to do this if your goal is to get the best learning model. We provide a minimum dataset mainly for others to quickly reproduce the first order of the results.
Nonetheless, for better understanding and debugging (I'm not clear if there is anything missed during your training) I would try QoE_hd https://github.com/hongzimao/pensieve/blob/master/sim/multi_agent.py#L260-L263 and do the sanity check in figure 3b (mentioned above).
Hope that helps.
Dear Hongzi:
Thanks for your suggestions, I will try QoE_hd later.
I just fixed a bug for picking the model with the best validation performance.
My current method is:
When the value of ENTROPY_WEIGHT needs to be changed, I stop the program, load the previous trained model with the highest mean reward (use test_sim_traces for validation ), then .....
The updated performance is:
which is similar with the pre-trained model provided by you (i.e., 6-7% improvement).
Does this method make sense? and same with yours (except the selection of validation set)?
Thanks
This makes sense. Thanks.
Dear Hongzi:
I did many experiments based on the pensieve's source code, but I cannot get the equivalent performance as reported in the sigcomm paper (12%-25% outperforms Robust-MPC).
Below is the result:
At first, I used the pre-trained model (i.e. pretrain_linear_reward.ckpt) provided in the source code to do the test with two sets of trace data (i.e. train_sim_traces and test_sim_traces) and the ENTROPY_WEIGHT=0.5:
Fig .1 Fig. 2
We can see that pensieve outperfomed Robust-MPC about 6~7%.
Second. I do the training by myself. I fixed the bug mentioned in #20 and followed the ENTROPY_WEIGHT tuning strategy in #11 . I also selected the model based on a validation set (parts of trace data provided in the source code) to avoid the fluctuation issue in #28 . The QoE function is the linear one:
Fig.3
We can see that the final testing performance is similar with that in #11 , but much worse than the performance in the sigcomm paper.
Did I do something wrong or anything important I didn't do, so that I couldn't get the same result as you described in the paper? Can you pls give me a hand to solve these questions, any answer is highly appreciated.
Thanks a lot