Reproducibility problem with Walker, Ant, Humanoid, HalfCheetah

openai / roboschool

DEPRECATED: Open-source software for robot simulation, integrated with OpenAI Gym.

Other

2.11k stars 489 forks source link

Reproducibility problem with Walker, Ant, Humanoid, HalfCheetah #105

Open acohen13 opened 6 years ago

acohen13 commented 6 years ago

I am having a problem with reproducing the same run on the Walker, Ant, Humanoid and HalfCheetah environments when using the same seed for np.random.seed and env.seed. I am not having the issue with Hopper or Reacher.

I have been able to track the inconsistency to the 9th or 10th decimal in the floats in j.current_relative_position() for j in self.ordered_joints array in gym_forward_walker.py and the problem starts after a few dozen to a few hundred steps into the run. To be specific, identical action vectors in identical states produce slightly different current_relative_position() arrays. The differences accumulate and become significant in later iterations.

I do not think it is a numpy/gym seeding issue since the runs are identical for a hundred steps or so and there is no problem with Hopper/Reacher. I have the problem on both ubuntu 16.04 and a cluster running Redhat.

olegklimov commented 6 years ago

Wow. You must have spent quite a lot of time tracking that down. I think it's possible just for multithreading and timing to have similar effect, that may explain the difference between envs. Can you share what you need exact determinism for? (For learned policies to be better, normally you add more nondeterminism.)

acohen13 commented 6 years ago

Thank you very much for your response.

We are using nondeterministic policies and we also don't expect the environment to necessarily be deterministic within a single experiment. The issue is that I ran our algorithm with a particular set of parameters for 300 learning iterations on Walker at 16 different starting seeds. I saw good results and wanted to see the performance after 1000 iterations with the same parameters and starting seeds. When I ran these, the first 300 out of 1000 iterations was actually significantly worse than the first experiment with only 300 iterations. In the future, when actually making adjustments to the algorithm, I'd like to be sure that the change in results is due to the adjustment and not unlucky/lucky randomness.

Do you think there is anything I can do to address this? Do you see the same behavior on Walker or is it possible it is just on my system?

woonsangcho commented 6 years ago

I have this issue when comparing my algorithm against benchmarks. I've observed that it's not about np/random/env seeding. Since it's costly to run one instance, the uncontrollable randomness hurts for such purposes.

acohen13 commented 6 years ago

Do you have the problem on all domains? I've noticed that reacher and hopper do not have this issue but the rest of the walker-based domains do.

olegklimov commented 6 years ago

I definitely didn't seen anything like that in Humanoid. (Walker is too easy)

All seeds are very close together for Humanoid, and I found it suitable for tuning hyperparameters.

nrontsis commented 6 years ago

I'm running Bayesian Optimisation for tuning the OpenAI baselines (PPO) on CPU with RoboSchool as a physics backend and I am observing non-determinism on all the environments I have tried. Is there any hint about where is the non-determinism coming from?

olegklimov commented 6 years ago

@nrontsis try inserting sleep after initialization. If that helps, it would mean initialization in Bullet thread is to blame. (after initialization it works synchronously) I think it's worth trying because I've observed strange things rendered on first frame.

nrontsis commented 6 years ago

Thanks @olegklimov Do you mean right after gym.make(...) or somewhere e.g. here?

olegklimov commented 6 years ago

Yeah maybe, try some things. Also see if you get variety of outcomes of your rollouts or some one dominant outcome and others are outliers.

nrontsis commented 6 years ago

Adding delay didn't help. After correcting some errors in my code I reached to a similar conclusion as @acohen13 Hopper, Reacher and all of the Pendulums are deterministic. The other environments are not. I did not check the distribution of the outcomes of the rollouts.

Maybe they should be flagged as non-deterministic, similarly to EnvSpec.nondeterministic?