hongzimao / pensieve

Neural Adaptive Video Streaming with Pensieve (SIGCOMM '17)
http://web.mit.edu/pensieve/
MIT License
524 stars 280 forks source link

Some questions on critic gradient #84

Open linnaeushuang opened 5 years ago

linnaeushuang commented 5 years ago

Dear Hongzi,

sorry to bother you, but I went through a few problems about critic gradient when reproducing Pensieve with PyTorch.

In /sim/a3c.py,you used the mean square error of R_batch and criticNetwork_output(value function,in a3c.py line 243).But R_batch is the cumulative rewards in particular episode.In original paper,pensieve should use mean square of *r+\gamma V(s{t+1}) and V(s{t})** (paper,equation-3).

Is this a typo problem or something else?

I have no idea how different between that,so I implemented 3 models to verify it:

  1. model 0 : only use pytorch to reproduce pensieve,no logical changes.use R_batch to update critic network
  2. model 1 : according to equation-3,think of s_batch[:-1] as states,and s_batch[1:] as new states.(is that correct?)
  3. model 2 : no critic network,only actor network.

I found a phenomenon that even without critic network,I can get the similar results.Is this a corrected phenomenon?why?

see pensieve-pytorch for details.

hongzimao commented 5 years ago

Thanks for the effort of reproducing pensieve in pytorch! What you pointed out is different ways of training the critic network. Using r + \gamma * V(s_t+1) - V(s_t) as the loss signal is TD(0) update. What we implemented in the code is the Monte Carlo update (or TD(1)). The Monte Carlo approach is unbiased while TD(0) gives the smallest variance.

When you said you got similar results for the three approaches, how "similar" are they? You can systematically look into the prediction error of the value function and should be able to see the difference. The reason the three approaches are close to each other might be because the horizon length is not too large (so the bias-variance trade-off is not apparent).

If you are interested, the variance in the value prediction is due to an external process (i.e., the variance from the network bandwidth time series). For long videos (thus long time horizon), dealing with the variance properly actually matters a lot. Please see this paper for more mathematical details: https://people.csail.mit.edu/hongzi/var-website/index.html. We also evaluated on a more realistic setting in this paper: https://openreview.net/forum?id=SJlCkwN8iV

Hope these help!