DDPG seed replicability issue

openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

MIT License

15.78k stars 4.88k forks source link

DDPG seed replicability issue #277

Open vvanirudh opened 6 years ago

vvanirudh commented 6 years ago

I have noticed that if you run the DDPG code (with default parameters and environment) with the same seed twice, I get different actor and critic losses across epochs. As a matter of fact, almost all the metrics that we log are different even though we use the same seed (and the same exact parameters).

How do you ensure that runs are replicable with same seed?

watts4speed commented 6 years ago

I remember reading a discussion about Caffe having a similar issue, when using CUDA.

olegklimov commented 6 years ago

How do you ensure that runs

Hi @vvanirudh !

Can you please explain first why you need to exactly reproduce seeds for DDPG?

Assuming you have reasons, that is what you can do:

test if your environment is deterministic if you feed the same (maybe random) actions,
check if DDPG environment interaction produces the same actions,
there should not be much random things in DDPG except batch shuffle, so maybe check if first batch contents doesn't change given the same data from interaction with environment.

vvanirudh commented 6 years ago

I am comparing two algorithms on the same environment (one of them being DDPG). Just to check if they are initially encountering the same environments, I set the seed to be exactly the same. I noticed that several runs of the same algorithm (say, DDPG) on the same environment with the same seed had starkly different training curves.

I have tried what you have suggested:

The environment is deterministic. I always get the same observations when I feed in a fixed random sequence of actions (if the seed is the same)
No, the actions predicted by two DDPG policies stay same until 1 or 2 epochs (800 episodes of 50 time-steps each), and then they slowly differ (and hence, future observations differ too and it snowballs)
I haven't tested this out yet