openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.22k stars 8.58k forks source link

Reproducibility: default random number generator #91

Closed pierrelux closed 8 years ago

pierrelux commented 8 years ago

The default random number generator of numpy seems to be used throughout the code (grep np.random). For example in: mountain_car.py, line 51.

It would be preferable to give the user the ability to explicitly pass a random number generator around. This is the approach adopted in scikit-learn for example (see random_state in sklearn.linear_model.RandomizedLogisticRegression).

JKCooper2 commented 8 years ago

In supervised learning algorithms being able to set the seed is useful so you can mess with the other parameters and see a direct comparison, but your data is also static.

In reinforcement learning your data is not static as what you've seen so far influences the actions/exploration you take in future and therefore the future data you see. Because the stochasticity is in the environment rather than in the algorithms improvement process, fixing the random state to a particular value in the same way and optimising the algorithms parameters (particularly those to do with exploration/learning rates) will likely result in a worse general algorithm. It puts a reliance on something it can't actually control because in the real usage case the data set will be different each time even in the same environment.

It would make the results more reproducible (particularly for more stochastic environments like Roulette) but I think merging results using multiple reproductions is a more effective way to determine the algorithms true ability on an environment.

pierrelux commented 8 years ago

I do very much agree with you that RL algorithms should not be seed-dependent and function robustly in the face of noise. Multiple runs (10, 100, 1000...) of the same algorithms/RL agent should be averaged to obtain a good characterization of its behavior.

However, the ability to keep the seed fixed can be very useful during the implementation phase of a new algorithm. When I refactor my code, I want to be able to check that the previous behavior hasn't been altered. Without the ability to seed the seed, it would rather be difficult to write tests. Also, passing around a RandomState object doesn't entail "fixing the random state to a particular value" since the random state changes every time a new pseudo random number is generated.

gdb commented 8 years ago

Yes, I agree, it's important for us to add support for seeds. This is useful for debugging, and also useful for our evaluations: we should capture what seed was used, so we can tell for multiple runs of a single algorithm if it's the same seed or different seed.

I'd like to do this by adding seed as an argument to monitor.start, and probably adding a seed method to env.

NeilGirdhar commented 8 years ago

Please don't pass seeds around. If you already depend on numpy, you should pass a numpy RandomState object around. This has many advantages:

Note that if a method accepts a random state object and you just want to seed it, just do something like:

from scipy.stats import poisson
from numpy.random import RandomState

poisson_distribution = poisson(mu=4.2)
poisson_distribution.rvs(4, random_state=RandomState(123))

This always returns array([5, 6, 3, 6]).

joschu commented 8 years ago

I agree that we should use a numpy RandomState. One question is whether to have it global for the library, or per-environment. Right now I'm leaning towards a per-environment RandomState, in case someone's doing multi-threading. Any thoughts?

NeilGirdhar commented 8 years ago

I think it's better design to have it as a per-environment member. If you had a global, you might as well use the global numpy one, which is what you're already doing.

gdb commented 8 years ago

It'd definitely be cleaner generally to pass around a RandomState. However, I think Gym has some special requirements which make me think seeds are the right way to go:

I might be missing something, but my preferred solution right now is to have a per-env seed. This would work as follows:

seed = 387387 # chosen arbitrarily
< user set seeds for algorithm code, might be tensorflow, numpy, whatever >
env.monitor.start(…, seed=seed) # calls env.seed(seed) under the hood
< then alg stuff>
JKCooper2 commented 8 years ago

Would it be better to create a params object that can be passed into the monitor that holds the seeds used as well as other algorithm variables the user wants to indicate are modifiable?

So you would have

params = { "env_seed": 387387, "alg_seed": 152, "alpha": 0.05 }
env.monitor.start(params=params)

If env_seed is contained within the params then the environment uses that, else the environment randomly selects the env_seed and adds that to the params. The params then get uploaded to OpenAI along with the results and the env_seed and alg_seed are used for more accurately determining the variance of the model where over multiple runs (if one seed is the same for multiple uploads), and the other params are either used to set the algorithms parameters (if the author is created a new algorithm page), or checked to see a reproduction matches the original authors required parameters (so that the variance in the model is only based on the random seeds)

For a reproduction to be considered valid it must use the same parameters (which OpenAI can now check) with the exception of the random seeds. People can still copy code and submit runs as they do now in order to see evaluations, but reproductions of algorithms just have stricter requirements.

NeilGirdhar commented 8 years ago

@gdb I'll address your concerns:

Have I missed anything?

JKCooper2 commented 8 years ago

Are we coming at this from two different sides? I thought the goal was to have the ability make both the algorithm and the environment deterministic even though you are calling random functions so you can easily compare algorithm parameter changes.

And that the issue was if you were combining uploaded results from multiple runs and one user submitted 100 evaluations with the algorithm seed set to 999 and the environment seed to 137137 then the performance distribution for the algorithm will be way off it's true ability over that environment.

There's no issues as far as I can tell if the user does use actual random values, however they choose to do it, that's actually a good result (although knowing the seeds used would be very slightly better)

NeilGirdhar commented 8 years ago

Is there any reason not to use keyword parameters as your "parameters object"?

Anyway, this is what I suggest doing in your constructor:

def __init__(self, random_state=None, …):
    if random_state is None:
        random_state = np.random
    elif isinstance(random_state, numbers.Integral):
        random_state = np.random.RandomState(random_state)
    self.random_state = random_state

If you want per-call random-state, you can use the same patter in each function, but using self.random_state instead of np.random as your default in the first case.

Then when you need random numbers, just use self.random_state.rand, etc.

JKCooper2 commented 8 years ago

I'm saying that you could pass the algorithm parameters into the monitor through a params argument. This allows a user to custom specify however many parameters and values they want that are applicable to their algorithm without having to set them up manually. The only use for those parameters it for comparing reproductions.

If you required specific arguments to be set in monitor.start(), that would mean if I wrote an algorithm that had 5 algorithm specific parameters, and you wrote one that used 5 different parameters because we are writing a different style of algorithm, then the monitor.start call would have to accept 10 different parameter types, and be updated each time someone wants to use a new param type/call it something different.

The params dictionary I'm referring to would just a container for everything specified in the algorithm, with the only value that actually influences the environment being if they set params['env_seed']. That field would use a similar structure to what you stated, except it would look like:

def __init__(self, alg_params={}, …):
    if 'env_seed' not in alg_params:
        alg_params['env_seed'] = # Randomly select a fixed seed to seed the environment with
    if 'alg_seed' not in alg_params:
        alg_params['alg_seed'] = None  # Just so OpenAI knows it wasn't set

Then on upload the alg_params dictionary is just converted to JSON and uploaded. All arguments except the seeds are used to determine if the reproduction parameters match those specified by the original author (alpha is 0.5, hidden layers = 2, bins used = 10, etc.). If the reproduction matches then the result can be merged with the original authors results, and then the env_seed and alg_seed are used to make sure the scores aren't biased towards someone who submitted 50 results using the same alg_seed or env_seed

NeilGirdhar commented 8 years ago

I agree with everything there, except why would you want your algorithm to use a different RNG than your environment? Also, I still think using seeds is a bad idea for all the reasons I suggested above. You can still serialize RNG states, which is more flexible.

gdb commented 8 years ago

@NeilGirdhar: Thanks for the thoughtful comments.

I think some environments will have their own randomness sources which we can only access by setting a seed (for example, I've only skimmed a bit, but I think this is true for ALE: https://github.com/mgbellemare/Arcade-Learning-Environment#list-of-command-line-parameters). So, I am not sure that RandomState is a viable option. (Of course, you could use the RandomState to draw a seed, but that sort of defeats the purpose.)

I think one major difference between scipy and gym is that scipy functions are morally "your code" (i.e. code that you'll want to customize and integrate directly into your algorithm) while gym code is morally "benchmark code" (i.e. blackbox code which should be used in one standard, correct way). Thus, we need a much narrower and less flexible interface that for scipy.

As well, environments currently have no interesting mutable state to pickle (i.e. they can be pickled and unpickled, but any partial progress on an episode is lost). I would actively prefer to keep it that way for now, since again it keeps the surface area contained.

@JKCooper2 I think it's better to not impose partial structure on the algorithm params dictionary. It's always a bit confusing to document partially structured dictionaries, you run a vague risk of colliding with a name someone was hoping to use for their own purposes, but most importantly it's a less clean interface.

From this conversation, I'd like to:

  1. Add a seed method to environment, which sets up whatever per-env environment randomness source there is.
  2. Add a seed method to monitor.start. This both records the seed used in the monitor's env_info, and calls seed on the environment.
  3. Add a algorithm_params dictionary to monitor.start, which again records the params into the monitor's env_info.

There's also a bit of cleanup work to actually make part 1 work — i.e. move everything to not use the global random number generator.

@NeilGirdhar I know you are probably not sold on this approach, and I'll admit there's a chance we'll need to change it later. However, I believe this is the best approach for now given the constraints of the project. I really appreciate you bringing this up, and your arguments have helped me get a lot better perspective on the tradeoffs here.

gdb commented 8 years ago

(Also @JKCooper2, happy to review if you're interested in implementing! Also, I think we should just do a _seed method on env, with no corresponding seed public method for now. That'll leave us more flexibility to change for the future.)

JKCooper2 commented 8 years ago

That sounds like a good place to start. The only thing will be for running multiple environments if you choose to set up different seeds for each environment and how this should be combined together.

@gdb I'm heading away to Japan for 2 weeks from Sunday but if it still needs doing when I get back I'm happy to give it a go

gdb commented 8 years ago

I think this works fine for multiple envs: each is seeded independently.

Enjoy the trip! This shouldn't take long so we'll likely have it done by the time you're back :).

JKCooper2 commented 8 years ago

More I meant if the seeds get merged together into a list when they are uploaded to OpenAI, I'm not sure the sever will be able to determine very much as there isn't any relation between the list of seeds used and the list of rewards (because the environments may have clumps of timestamps together). I don't think this is the worst situation, it just means to start off with that you will only be able to reduce bias of submitted env seeds when the seed used is the same for all environments in that submission

Because most of the bias issue appears to be with the algorithm seed I can't see this causing any problems, but something I think we should be aware of