openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.5k stars 8.59k forks source link

Write more documentation about environments #106

Closed joschu closed 3 years ago

joschu commented 8 years ago

We should write a more detailed explanation of every environment, in particular, how the reward function is computed.

JKCooper2 commented 8 years ago

Here's how I imagined a basic environment documentation page looking link. Let me know if you have any suggestions then I'll transfer it to MarkDown and see how it looks in that

nealmcb commented 8 years ago

@JKCooper2, thanks for the link. That page seems pretty complete.

I'm not sure how you see the Challenge section working. Can you add or point to some examples?

But I wonder where we want to document these things. To avoid duplication, I'd think the code should contain the primary documentation, including what is documented in your sections 1 and 2: Overview and Environment. I suggest that we should just document each aspect in the appropriate part of the code. There could ideally be a standard build tool to pull out the appropriate documentation (via tags of some sort?) and update a main landing page on the site for each environment, presumably https://gym.openai.com/envs/CartPole-v0, which also documents how various algorithms have worked on it.

For the "Research" section, a link to some pages on the site that describe relevant research would also be fine in the code. Or we might just want to just link from the landing page to an associated wiki page on the site, that could discuss research, proposed algorithms, etc.

JKCooper2 commented 8 years ago

For the Challenge section I was thinking that it would have a list of criteria that means you could tell whether your algorithm could theoretically solve the environment (ignoring hyper-parameters / computational limitations).

Examples would include:

CartPole: 1B, 2A, 3A, 4A Acrobot: 1B, 2A, 3A, 4B, 9 MountainCar: 1B, 2A, 3A, 4A, 9 Pendulum: 1B, 2B, 3A Taxi: 1A, 2A, 3B, 4B, 7 Pacman: 1B, 2A, 3B, 4B, 5, 7 Pacmac-Ram: 1B, 2A, 3B, 4B, 5, 7, 8 (maybe)

The idea then being if I create an algorithm that can handle 1B, 2A, 3A, 3B, 4A, 4B, 9 then it should be capable of solving (with only hyper-parameter changes) CartPole and Acrobot, but won't be capable of solving any of the others

That's obviously an incomplete set and the definitions some may not make sense or need to be altered, but it would have a lot of benefits:

  1. Easy to tell what your algorithm will/won't work on and why
  2. Focus for creating environment's that have challenge sets that don't already exist
  3. Good comparability on algorithms with matching 'abilities'
  4. Focus for creating algorithms with specific abilities
  5. Much simpler to take an rl agent and apply it to a real-world problems (just define the problem and select an algorithm that meets the challenge criteria)
  6. Test task recognition and generalisation over similar environments (based on matching challenge sets)
JKCooper2 commented 8 years ago

For the environment documentation I was imagining it like a project/assignment description. I don't think people should need to look in the code for information about how the environment works, and would prefer it to be listed independently even if it means some duplication (although not a lot because it would only be updated if the environment version changes).

The Research section was partly to make it easier for researchers to identify was relevant papers exist for what they are working on, but also to encourage people to replicate existing research algorithms in order to improve the benchmark quality. I think replicated published research algorithms should be given special treatment or at least marked differently so people can easily see "This algorithm is a working copy of this paper". Having wiki style info regarding the papers could be useful, but I think it would work better to have links from the environment documentation research section to a summary page for the paper that has that information.

I was thinking it would sit on something like https://readthedocs.org/ where the documentation would be updated via git and you can have sub-menu's on the side to choose which section you're viewing.

Discussion and comments should be separate from documentation, maybe a forum. The goals as I see it should be to make it simple for people to understand the task (docs), share relevant information (research), come up with new ideas (forum/gitter), and focus effort (challenges/requests for research).

nealmcb commented 8 years ago

Thanks. I agree that the documentation should be clear about research that it was based on etc, and the forum/wiki would just be to make it easier for folks to comment and add information.

Re: "Challenge", that was the sense I had, and your details help a lot.

The code already defines a lot of this with great precision via the Env.{action_space, observation_space,reward_range} variables. I'm hoping it would be easier to just capture that information via introspection, hopefully as part of the build process, and automatically generate a concise and easy to use representation of it for inclusion in the documentation. Otherwise, we run the risk of the documentation lagging behind the code or disagreeing with it.

I haven't yet looked at enough environments here to be sure what you mean by 7, 8, 9, but more generally, a useful categorization scheme for AI environments, based on Russell and Norvig (2009), is at https://en.wikibooks.org/wiki/Artificial_Intelligence/AI_Agents_and_their_Environments:

"[they] can be remembered with the mnemonic "D-SOAKED." They are:

  • Deterministicness (deterministic or stochastic or Non-deterministic): An environment is deterministic if the next state is perfectly predictable given knowledge of the previous state and the agent's action.
    • Staticness (static or dynamic): Static environments do not change while the agent deliberates.
  • Observability (full or partial): A fully observable environment is one in which the agent has access to all information in the environment relevant to its task.
  • Agency (single or multiple): If there is at least one other agent in the environment, it is a multi-agent environment. Other agents might be apathetic, cooperative, or competitive.
  • Knowledge (known or unknown): An environment is considered to be "known" if the agent understands the laws that govern the environment's behavior. For example, in chess, the agent would know that when a piece is "taken" it is removed from the game. On a street, the agent might know that when it rains, the streets get slippery.
  • Episodicness (episodic or sequential): Sequential environments require memory of past actions to determine the next best action. Episodic environments are a series of one-shot actions, and only the current (or recent) percept is relevant. An AI that looks at radiology images to determine if there is a sickness is an example of an episodic environment. One image has nothing to do with the next.
  • Discreteness (discrete or continuous or ): A discrete environment has fixed locations or time intervals. A continuous environment could be measured quantitatively to any level of precision."

As far as I've seen, the gym might not yet support some of these options. But adding a way to encode and document and perhaps declare (in the code) the rest of those would indeed be helpful. I imagine this has come up before - does anyone know if we can leverage other existing work on a typology of environments?

nealmcb commented 8 years ago

To play around with extracting useful documentation automatically from the code, I wrote a little program to query a bunch of things about the environments for display in a markdown table. Here is the code and the output, on the subset for which make() works for me at the moment. A few variables which don't vary in this dataset are commented out.

from gym import envs

class NullE:
    def __init__(self):
        self.observation_space = self.action_space = self.reward_range = "N/A"

envall = envs.registry.all()

table = "|Environment Id|Observation Space|Action Space|Reward Range|tStepL|Trials|rThresh\n" # |Local|nonDet|kwargs|
table += "|---|---|---|---|---|---|---|---|---|---|\n"

for e in envall:
    try:
        env = e.make()
    except:
        env = NullE()
        continue  #  Skip these for now
    table += '| {}|{}|{}|{}|{}|{}|{}|\n'.format(e.id,   #|{}|{}|{}
       env.observation_space, env.action_space, env.reward_range,
       e.timestep_limit, e.trials, e.reward_threshold) # ,
       # getattr(e, 'local_only', -1), e.nondeterministic, getattr(e, 'kwargs', ""))

print(table)
Environment Id Observation Space Action Space Reward Range tStepL Trials rThresh
CartPole-v0 Box(4,) Discrete(2) (-inf, inf) 200 100 195.0
NChain-v0 Discrete(5) Discrete(2) (-inf, inf) 1000 100 None
RepeatCopy-v0 Discrete(6) Tuple(Discrete(2), Discrete(2), Discrete(5)) (-inf, inf) 200 100 75.0
Reverse-v0 Discrete(3) Tuple(Discrete(2), Discrete(2), Discrete(2)) (-inf, inf) 200 100 25.0
ReversedAddition-v0 Discrete(4) Tuple(Discrete(4), Discrete(2), Discrete(3)) (-inf, inf) 200 100 25.0
Acrobot-v0 Box(4,) Discrete(3) (-inf, inf) 200 100 -100
FrozenLake-v0 Discrete(16) Discrete(4) (-inf, inf) 100 100 0.78
Taxi-v1 Discrete(500) Discrete(6) (-inf, inf) 200 100 9.7
Pendulum-v0 Box(3,) Box(1,) (-inf, inf) 200 100 None
OneRoundNondeterministicReward-v0 Discrete(1) Discrete(2) (-inf, inf) 1000 100 None
ReversedAddition3-v0 Discrete(4) Tuple(Discrete(4), Discrete(2), Discrete(3)) (-inf, inf) 200 100 25.0
Roulette-v0 Discrete(1) Discrete(38) (-inf, inf) 100 100 None
MountainCar-v0 Box(2,) Discrete(3) (-inf, inf) 200 100 -110.0
FrozenLake8x8-v0 Discrete(64) Discrete(4) (-inf, inf) 200 100 0.99
DuplicatedInput-v0 Discrete(6) Tuple(Discrete(2), Discrete(2), Discrete(5)) (-inf, inf) 200 100 9.0
Blackjack-v0 Tuple(Discrete(32), Discrete(11), Discrete(2)) Discrete(2) (-inf, inf) 1000 100 None
Copy-v0 Discrete(6) Tuple(Discrete(2), Discrete(2), Discrete(5)) (-inf, inf) 200 100 25.0
TwoRoundDeterministicReward-v0 Discrete(3) Discrete(2) (-inf, inf) 1000 100 None
TwoRoundNondeterministicReward-v0 Discrete(3) Discrete(2) (-inf, inf) 1000 100 None
OneRoundDeterministicReward-v0 Discrete(1) Discrete(2) (-inf, inf) 1000 100 None
tlbtlbtlb commented 8 years ago

That table is extremely useful.

JKCooper2 commented 8 years ago

I like the table. It should be possible to export the bounds of the box spaces as well with some minor adjustments. The environments will only change very rarely so I wouldn't get too hung up on having it being exported. The D-SOAKED listing is good but I don't think it covers enough of the agents required abilities, e.g. All of the environments in the classic control section fall under the same D-SOAKED criteria yet you can't take all of the algorithms that solved one and have them solve the rest.

For '7 Goal changes over time' an example could be Reacher that has a randomly located target as opposed to Acrobot where the target is always the same. This can also mean that a straight decaying exploration rate mightn't be effective For '8 Observation state incomplete' I mean that the agent could be given information that it needs to use in future states. E.g. Reacher where the agent is told the position and then has 50 time steps (without being retold the position, it just gets information about it's own location) to reach towards the goal, being scored on the 50th step '9 Fixed reward surrounding initial state' is to cover exploration scenarios like MountainCar, Acrobot, FrozenLake where the agent needs to perform a long chain of actions in order to see a different reward.

nealmcb commented 8 years ago

Yes - adding the bounds is on my list. I actually think that the repr function of a space should conform to the norm and return a full, evalable string with the bounds. Perhaps the str function could return what repr does now, for simplicity and better upward-compatibility. And for convenience, the constructor should work with a list of low or high bounds, not just an array of them.

In the meantime, here is a version of the table sorted by parameterization, with hot links, for your viewing pleasure. One of the values of generating the documentation from the source, or at least from a nice clean machine-readable format, is the ease of sorting, comparing, searching etc.

Note that it seems that some of the environments don't have a page on the gym site yet, and generate "Something went wrong! We've been notified and are fixing it.". E.g. https://gym.openai.com/envs/OneRoundNondeterministicReward-v0

Environment Id Observation Space Action Space Reward Range tStepL Trials rThresh
MountainCar-v0 Box(2,) Discrete(3) (-inf, inf) 200 100 -110.0
SemiSupervisedPendulumRandom-v0 Box(3,) Box(1,) (-inf, inf) 1000 100 None
SemiSupervisedPendulumDecay-v0 Box(3,) Box(1,) (-inf, inf) 1000 100 None
SemiSupervisedPendulumNoise-v0 Box(3,) Box(1,) (-inf, inf) 1000 100 None
Pendulum-v0 Box(3,) Box(1,) (-inf, inf) 200 100 None
CartPole-v0 Box(4,) Discrete(2) (-inf, inf) 200 100 195.0
Acrobot-v0 Box(4,) Discrete(3) (-inf, inf) 200 100 -100
InterpretabilityCartpoleObservations-v0 Box(4,) Tuple(Discrete(2), Box(4,), Box(4,), Box(4,), Box(4,), Box(4,)) (-inf, inf) 1000 100 None
InterpretabilityCartpoleActions-v0 Box(4,) Tuple(Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2), Discrete(2)) (-inf, inf) 1000 100 None
OneRoundNondeterministicReward-v0 Discrete(1) Discrete(2) (-inf, inf) 1000 100 None
OneRoundDeterministicReward-v0 Discrete(1) Discrete(2) (-inf, inf) 1000 100 None
Roulette-v0 Discrete(1) Discrete(38) (-inf, inf) 100 100 None
FrozenLake-v0 Discrete(16) Discrete(4) (-inf, inf) 100 100 0.78
TwoRoundDeterministicReward-v0 Discrete(3) Discrete(2) (-inf, inf) 1000 100 None
TwoRoundNondeterministicReward-v0 Discrete(3) Discrete(2) (-inf, inf) 1000 100 None
Reverse-v0 Discrete(3) Tuple(Discrete(2), Discrete(2), Discrete(2)) (-inf, inf) 200 100 25.0
ReversedAddition-v0 Discrete(4) Tuple(Discrete(4), Discrete(2), Discrete(3)) (-inf, inf) 200 100 25.0
ReversedAddition3-v0 Discrete(4) Tuple(Discrete(4), Discrete(2), Discrete(3)) (-inf, inf) 200 100 25.0
NChain-v0 Discrete(5) Discrete(2) (-inf, inf) 1000 100 None
Taxi-v1 Discrete(500) Discrete(6) (-inf, inf) 200 100 9.7
Copy-v0 Discrete(6) Tuple(Discrete(2), Discrete(2), Discrete(5)) (-inf, inf) 200 100 25.0
RepeatCopy-v0 Discrete(6) Tuple(Discrete(2), Discrete(2), Discrete(5)) (-inf, inf) 200 100 75.0
DuplicatedInput-v0 Discrete(6) Tuple(Discrete(2), Discrete(2), Discrete(5)) (-inf, inf) 200 100 9.0
FrozenLake8x8-v0 Discrete(64) Discrete(4) (-inf, inf) 200 100 0.99
OffSwitchCartpole-v0 Tuple(Discrete(2), Box(4,)) Discrete(2) (-inf, inf) 1000 100 None
Blackjack-v0 Tuple(Discrete(32), Discrete(11), Discrete(2)) Discrete(2) (-inf, inf) 1000 100 None
Timopheym commented 8 years ago

@gdb why not to open a wiki here so we can move this awesome table there and have a community-driven documentation?

nealmcb commented 8 years ago

I like that idea, @Timopheym. I don't know if we want to use the wiki feature here, but I decided to "Be Bold" as we say on Wikipedia, and went ahead and put up an example of using it for this. I also expanded the table to 158 environments: all the ones I could "make" with a standard pip install gym[all]: https://github.com/openai/gym/wiki/Table-of-environments

gdb commented 8 years ago

(Enabled the wiki! Please make edits!)

On Wednesday, June 22, 2016, Neal McBurnett notifications@github.com wrote:

I like that idea, @Timopheym https://github.com/Timopheym. I don't know if we want to use the wiki feature here, but I went ahead to put up an example of using it for this. I also expanded the table to 158 environments: all the ones I could "make" with a standard pip install gym[all]: https://github.com/openai/gym/wiki/Table-of-environments

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openai/gym/issues/106#issuecomment-227926638, or mute the thread https://github.com/notifications/unsubscribe/AAM7kfR-trgDhCf5T5BSF6wW_tRWyOK2ks5qOeNUgaJpZM4IhR7e .

Sent from mobile

kovacspeter commented 8 years ago

It would be great if there were also bounds for Box actions eg. [-1,1], also MuJoCo environments are currently missing.

nkming2 commented 7 years ago

The table is currently not shown correctly on the wiki page. This patch should fix that. Cheers https://gist.github.com/nkming2/f04d7a350d1e497014b23258ea9f4304

abhigenie92 commented 7 years ago

Is there a way for defining an environment, where I can change the action space at each step?

aurelien-clu commented 6 years ago

@abhigenie92 How is the action space changing at each step?

If it changes across several expected configuration you could have the following:

Otherwise I think you need to define your own Space class by extending gym.Space: https://github.com/openai/gym/blob/master/gym/core.py

madvn commented 6 years ago

Does something like this exist for MuJoCo environments? I am especially interested in finding the values of simulation specific params in MuJoCo such as 'dt' and also termination conditions.

rkaplan commented 6 years ago

Also wondering if there are more details about the MuJoCo environments. It would be nice to have more information about them on the website. Specifically I'm trying to check which MuJoCo environments are deterministic / stochastic.

ling-pan commented 6 years ago

I am wondering what each byte in the RAM means. Could anyone explain each field in the RAM, please?

nikonikolov commented 6 years ago

Hey, I fully agree there should be more documentation about environments. In my personal experience the most commonly needed information is:

  1. Observation: space type, shape, limits, components interpretation (if any - e.g. position, speed, etc.)
  2. Action space type, shape and limits, components interpretation
  3. Deterministic or stochastic. If stochastic, in what way exactly

It is not that this information cannot be found, but it usually takes much more time than it can take if it was properly summarized. For example, currently there is no documentation about stochasticity in MuJoCo environments and only a couple have information about the interpretation of components in the observation/action space. For Atari environments, there is no clear documentation about all the different versions and one has to dig through the code.

I have already collected some info which is currently not in the wiki (mainly about the atari environments, but it is very likely that I will also have to do the same for the MuJoCo ones). I really want to share this info on the wiki. Is there a required/recommended way to do this, or I can just follow the current examples such as https://github.com/openai/gym/wiki/BipedalWalker-v2.

bionicles commented 5 years ago
import gym

class String(gym.Space):
    def __init__(self, length=None, min_length=1, max_length=280, min=0, max=127):
        self.length = length
        self.min_length = min_length
        self.max_length = max_length
        self.letters = string.ascii_letters + " .,!-"

    def sample():
        length = random.randint(self.min_length, self.max_length)
        string = ""
        for i in range(length):
            letter = random.choice(self.letters)
            string += letter
        return string

    def contains(self, x):
        return type(x) is "str" and len(x) > 0
ghost commented 5 years ago

@nikonikolov could you please share your info on the Atari environments? I'm finding it very hard to figure them out

nikonikolov commented 5 years ago

Below is the info I have from my logs. This is from a few months ago, I have not checked if there have been any changes since then.

AtariEnvNoFrameskip-v4

AtariEnvNoFrameskip-v0

AtariEnvDeterministic-v4

AtariEnvDeterministic-v0

AtariEnv-v4

AtariEnv-v0

Additional points to bear in mind:

Please someone correct me if I got anything wrong.

nealmcb commented 5 years ago

I dare say this should be part of the Gym codebase, and integrated into updates to the algorithms. But for now here is the latest version of the code, with more sensible natural sorting integrated into it, and used just now to update the table in the wiki with the wealth of new environments in Gym.

import re
from operator import attrgetter
from gym import envs

class NullE:
    def __init__(self):
        self.observation_space = self.action_space = self.reward_range = "N/A"

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)

    >>> alist = [
    ...          'Orange County--1-3-288-117',
    ...          'Orange County--48256-242',
    ...          'Orange County--1-3-388-203',
    ...          'Orange County--1-19-19-150',
    ...          'Orange County--1-1-64-290',
    ...          'Orange County--1-1-55-256']
    >>> alist.sort(key=natural_keys)
    >>> from pprint import pprint
    >>> pprint(alist)
    [u'Orange County--1-1-55-256',
     u'Orange County--1-1-64-290',
     u'Orange County--1-3-288-117',
     u'Orange County--1-3-388-203',
     u'Orange County--1-19-19-150',
     u'Orange County--48256-242']
    '''

    return [ atoi(c) for c in re.split('(\d+)', text.split('|')[2]) ]

def atoi(text):
    "Convert text to integer, or return it unmodified if it isn't numeric"

    return int(text) if text.isdigit() else text

# TODO: Make first column a link, e.g. to [WizardOfWor-ram-v0](https://gym.openai.com/envs/WizardOfWor-ram-v0)
envall = envs.registry.all()

URL_PREFIX = 'https://gym.openai.com/envs'

table = []
for e in envall:
    try:
        env = e.make()
    except:
        env = NullE()
        continue  #  Skip these for now
    table.append('| {}|{}|{}|{}|{}|{}|{}|'.format(
       '[%s](%s/%s)' % (e.id, URL_PREFIX, e.id),
       env.observation_space, env.action_space, env.reward_range,
       e.timestep_limit, e.trials, e.reward_threshold)) # ,
       # getattr(e, 'local_only', -1), e.nondeterministic, getattr(e, 'kwargs', ""))

    # if len(table) > 30:  # For quicker testing
    #   break

# Sort by 2nd column: Observation Space name
table = sorted(table, key=natural_keys)

# Add headings
table = ["|Environment Id|Observation Space|Action Space|Reward Range|tStepL|Trials|rThresh", # |Local|nonDet|kwargs|
         "|---|---|---|---|---|---|---|"] + table

print('\n'.join(table))
KiaraGrouwstra commented 4 years ago

env tables inspired by @nealmcb but using Pandas in a notebook:

from collections import OrderedDict
from operator import mul
from functools import reduce
import numpy as np
import pandas as pd
from gym import envs

def space_size(spc):
  '''number of bytes in a space'''
  return 1 if not spc.shape else spc.dtype.itemsize * (spc.shape if spc.shape is int else reduce(mul, spc.shape, 1))

def space_cont(spc):
  '''whether a space is continuous'''
  return np.issubdtype(spc.dtype, np.floating)

def env_props(env):
  obs = env.observation_space
  act = env.action_space
  return OrderedDict([
    ('name', env.spec.id),
    ('obs_cont', space_cont(obs)),
    ('obs_size', space_size(obs)),
    ('stochastic', env.spec.nondeterministic),  # - deterministic vs stochastic (~= ^?)
    ('act_cont', space_cont(act)),
    ('act_size', space_size(act)),
#     ('reward_range', env.reward_range),
#     ('timestep_limit', env.timestep_limit),
#     ('trials', env.trials),
#     ('reward_threshold', env.reward_threshold),
  ])

def make_env(env_):
    try:
        env = env_.make()
    except:
        env = None
    return env

envall = envs.registry.all()
envs = [make_env(env) for env in envall]
envs = [x for x in envs if x is not None]

rows = [env_props(env) for env in envs]

# our env dataframe, show in a notebook cell
df = pd.DataFrame(rows)
df

# and a pivot!
mean = lambda x: round(np.mean(x), 2)
idx = ['obs_cont', 'act_cont', 'stochastic']
aggs = {
    'name': len,
    'obs_size': mean,
    'act_size': mean,
}
pd.pivot_table(df, index=idx, aggfunc=aggs)
Amritpal-001 commented 4 years ago

@joschu @JKCooper2 @nealmcb I compiled documentation on Fetch environments https://link.medium.com/CV6la7YfV7, I have tried to cover the observation and action variables, reward function, and comparison among all 4 fetch environments.

I hope it's helpful. Please add more info if you have any.

jkterry1 commented 3 years ago

Closing in favor of #2276