Open linnaeushuang opened 5 years ago
Thanks for the effort of reproducing pensieve in pytorch! What you pointed out is different ways of training the critic network. Using r + \gamma * V(s_t+1) - V(s_t) as the loss signal is TD(0) update. What we implemented in the code is the Monte Carlo update (or TD(1)). The Monte Carlo approach is unbiased while TD(0) gives the smallest variance.
When you said you got similar results for the three approaches, how "similar" are they? You can systematically look into the prediction error of the value function and should be able to see the difference. The reason the three approaches are close to each other might be because the horizon length is not too large (so the bias-variance trade-off is not apparent).
If you are interested, the variance in the value prediction is due to an external process (i.e., the variance from the network bandwidth time series). For long videos (thus long time horizon), dealing with the variance properly actually matters a lot. Please see this paper for more mathematical details: https://people.csail.mit.edu/hongzi/var-website/index.html. We also evaluated on a more realistic setting in this paper: https://openreview.net/forum?id=SJlCkwN8iV
Hope these help!
Dear Hongzi,
sorry to bother you, but I went through a few problems about critic gradient when reproducing Pensieve with PyTorch.
In
/sim/a3c.py
,you used the mean square error of R_batch and criticNetwork_output(value function,in a3c.py line 243).But R_batch is the cumulative rewards in particular episode.In original paper,pensieve should use mean square of *r+\gamma V(s{t+1}) and V(s{t})** (paper,equation-3).Is this a typo problem or something else?
I have no idea how different between that,so I implemented 3 models to verify it:
I found a phenomenon that even without critic network,I can get the similar results.Is this a corrected phenomenon?why?
see pensieve-pytorch for details.