rll / rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.
Other
2.89k stars 803 forks source link

Tensorflow TRPO with MountainCar Doesn't consistently converge #116

Open Breakend opened 7 years ago

Breakend commented 7 years ago

Using Tensorflow TRPO for the OpenAI gym MountainCar-V0 environment doesn't converge every run. Some runs might converge to a good policy. Others will stay at -200 reward forever.

Gist of code attempted: https://gist.github.com/Breakend/971fb46ce5418280cc35ff96adf95464

This seems to be an issue with the TRPO algorithm, whereas the original TRPO code does just fine on this task. This issue is to track possible solutions/bugs found in this investigation.

hapticloud commented 7 years ago

I've also found general issues with nondeterminism. I cannot achieve the same results every time. There are several sources of nondeterminism, but I haven't found them all yet.

The seed argument to run_experiment_lite is important. This is used to seed several random number generators.

I believe n_parallel=1 is important, because a fixed batch size is collected from n_parallel workers, which may be at different spots in trajectories and episodes for the same run. However, even with n_parallel=1 in run_experiment_lite, I cannot get repeatable results. With n_parallel>1, the results are always guaranteed to be nondeterministic in general, because of the asynchronous manner in which episodes are collected.

Another issue may be the environment seeding (which I also tried). InvertedPendulum-v1 is a deterministic environment, and calling the standard env.seed(args.seed) (the .seed() method for the OpenAI Gym Env class) for each worker process in _worker_populate_task, along with using the above two methods, still has not yielded deterministic results.

This is an important issue ... I've observed very high variance of InvertedPendulum-v1, and the benchmarking results in the paper were computed with five different random seeds, which seems small. The problem is made worse by the fact that even with the same seed, the results are nondeterministic.

Breakend commented 7 years ago

The seeds may be part of the issue, but i suspect there is more going on. In the original TRPO code, the benchmark results seem to occur with a seed of 0.0. I suspect the main problem is further in the stack, though because I'm not using run_experiment_lite and I see the same issues on non-determinism.

hapticloud commented 7 years ago

run_experiment_lite ends up calling BatchPolopt.train (like your gist), but first it sets the seed argument in each worker.

I recently achieved repeatable results with InvertedPendulum-v1 by setting both the Gym environment seed and calling ext.set_seed() in the beginning of sample_paths(), as well as using n_parallel=1. Doing either of these in isolation was not sufficient. Perhaps you could also try setting the seeds in the beginning of sample_paths().

Thus, doing the above should be sufficient for achieving repeatable results in deterministic environments with a single worker. I don't currently have an explanation for why this is sufficient (and why doing it only in _worker_populate_task was not sufficient)

dementrock commented 7 years ago

What do you mean by "the original TRPO code"?

Also, what version of TF are you using?

A larger batch size might help. Try say 50000. This task may require more samples to have sufficient exploration.

dementrock commented 7 years ago

@hapticloud

With gym environments you may need to explicitly set the seed on the Gym environment instance, since it uses its own random number generator. If env is an instance of GymEnv, do env.env.seed(0), etc.

You're right that n_parallel > 1 can lead to nondeterminism. Another source is using GPUs which sometimes gives slightly nondeterministic results.

Breakend commented 7 years ago

@dementrock I meant the code provided by the modular_rl paper, https://github.com/joschu/modular_rl

It seems like this more consistently converges faster, but this may be a result of the random seed. I'm using TF 1.2. I found that the RLLab version will converge quickly (<100 iterations with any batch_size ever ~50 runs) other than that it generally won't converge even with a large batch size and many iterations (>300). This may be just a factor of randomness. One thought I had, however, is that in the TRPO paper they talk about reducing the "variance of the Q-value differences between [sample] rollouts by using the same random number sequence for the noise in each of the K rollouts (i.e. comon random numbers)". It seems like this isn't happening here due to the parallelism, but I could be wrong. Do you know if it's somewhere in the code here?

dementrock commented 7 years ago

One thing to look into is the magnitude of observations. The modular-rl code does some running normalization of inputs, which helps a lot in many Gym tasks.

How come you are using 1.2? I think the latest version is 1.1.0rc2?

The common random numbers should be referring to a variant of TRPO called TRPO-vine, which isn't implemented in rllab.

Breakend commented 7 years ago

@dementrock Sorry that's a typo, meant to type 1.1 (specifically '1.1.0-rc1').

"The modular-rl code does some running normalization of inputs, which helps a lot in many Gym tasks." Ah, that may be the case, I'll look into that. Thanks. I originally thought that's what the normalize call was doing when you wrap an env, but it seems like that just normalizes the action space? correct?

EDIT: ahh, nvm, i see there's the normalize_obs flag. but I guess this isn't quite the same as the filter in modular_rl

"The common random numbers should be referring to a variant of TRPO called TRPO-vine, which isn't implemented in rllab." Ahh, ok thanks.

dementrock commented 7 years ago

The implementation of normalization differs. modular_rl keeps an actual running mean/std but rllab uses exponential decay. Not sure if it matters.

Breakend commented 7 years ago

Initial results seem to point at normalizing the observations (simply using the exponential decay in rllab) and forcing the FiniteDifferenceHvp vs. the Perlmutter (I believe that's the default?) seems to result in consistent convergence (with batch size 5000). Seems like the running mean/std filter is not needed. I'll try to repro this a few times and maybe add an example if you think that might be a good idea.