Closed hashbangCoder closed 8 years ago
Hi @hashbangCoder, in the DDPG implementation, we compute the on-policy Q value which provides gradient signal to improve the policy, but the sampling of trajectories is done off-policy. The DDPG
class accepts a parameter called es
, which is abbreviation for exploration strategy. It is then used to generate an off-policy action in this line: https://github.com/rllab/rllab/blob/master/rllab/algos/ddpg.py#L224
Currently we implement two exploration strategies, one is the Brownian motion noise as mentioned in the paper and the other is Gaussian noise. You can find them here: https://github.com/rllab/rllab/tree/master/rllab/exploration_strategies
Sheesh can't believe I missed that. My bad. Sorry.
Hi,
In the ddpg.py, I assume you're following this paper by Silver et.al. If so, your algorithm doesn't seem to mirror theirs. Over here, your are going on-policy for the actor. But DDPG is an off-policy approach due to exploration noise being added (which I can't seem to find in your code). If its completely deterministic, there is no scope for exploration. Unless you're following a stochastic policy in which case, doesnt it defeat the purpose of DDPG?
Am I missing something?