DDPG on-policy? - Githubissues

hashbangCoder commented 8 years ago

Hi,

In the ddpg.py, I assume you're following this paper by Silver et.al. If so, your algorithm doesn't seem to mirror theirs. Over here, your are going on-policy for the actor. But DDPG is an off-policy approach due to exploration noise being added (which I can't seem to find in your code). If its completely deterministic, there is no scope for exploration. Unless you're following a stochastic policy in which case, doesnt it defeat the purpose of DDPG?

Am I missing something?

dementrock commented 8 years ago

Hi @hashbangCoder, in the DDPG implementation, we compute the on-policy Q value which provides gradient signal to improve the policy, but the sampling of trajectories is done off-policy. The DDPG class accepts a parameter called es, which is abbreviation for exploration strategy. It is then used to generate an off-policy action in this line: https://github.com/rllab/rllab/blob/master/rllab/algos/ddpg.py#L224

Currently we implement two exploration strategies, one is the Brownian motion noise as mentioned in the paper and the other is Gaussian noise. You can find them here: https://github.com/rllab/rllab/tree/master/rllab/exploration_strategies

hashbangCoder commented 8 years ago

Sheesh can't believe I missed that. My bad. Sorry.

rll / rllab

DDPG on-policy? #6