openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.23k stars 8.59k forks source link

Random actions in Breakout? #559

Closed Hvass-Labs closed 7 years ago

Hvass-Labs commented 7 years ago

OpenAI Gym version 0.8.1 running on Ubuntu Linux with TensorFlow 1.0.1

I have observed a strange behaviour when playing Breakout-v0. I already know that the action is repeated randomly by Gym between 2-4 times. But quite frequently it appears that the action itself is chosen randomly. Not always, but quite often. I have debugged both my own code and Gym and it does not appear to be a problem in either of those. I have also gone over my data-structure for storing the states and actions that my agent has played, and it appears to be properly aligned so that states and actions match.

If neither my code nor Gym has a bug, then perhaps there is some randomness in the ALE Atari env? Perhaps this is intended?

When debugging I end up in ale_python_interface.py where the call to ale_lib.act(self.obj, int(action)) cannot be stepped into.

Has anyone had similar experiences? Does anyone have any information?

Example:

In the first state the action is taken to be FIRE, which I suppose should translate to NOOP in Breakout? But in the state that follows immediately afterwards, it can be seen that the movement was actually RIGHT. I have numerous examples like this.

state1 action is fire state2 movement was right

danijar commented 7 years ago

Did you figure this out? Might be interesting to see if directly connecting to ALE shows the same behavior. Either way, people working on Atari envs would probably be interested in this.

tlbtlbtlb commented 7 years ago

The -v0 versions of Atari games repeat the previous action with probability 0.25. The -v3 versions of games don't do this. Both are available to make possible fair comparisons against previous research.

danijar commented 7 years ago

Thanks a lot, this is very helpful.

Hvass-Labs commented 7 years ago

@tlbtlbtlb Please clarify. I know that Gym repeats the current action between 2-4 times, with the number selected randomly. But that is not the same as you just said about repeating the previous action with probability 0.25 (it is not the same in terms of current / previous action, nor in terms of probabilities). I don't recall having seen this 0.25 probability thing implemented in Gym's Python wrapper of ALE. Is this done inside ALE itself?

Are these things documented somewhere? They really should be.

tlbtlbtlb commented 7 years ago

@Hvass-Labs The -v0 versions, such as Breakout-v0, set repeat_action_probability=0.25. See https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L319. The -v3 versions, such as Breakout-v3, set repeat_action_probability=0. See https://github.com/openai/gym/blob/master/gym/envs/__init__.py#L327. The effect of repeat_action_probability is documented at https://github.com/openai/atari-py/blob/master/doc/manual/manual.pdf section 7.5.

Hvass-Labs commented 7 years ago

Thanks for the clarification.

Unfortunately, the problem with repeated actions still appears to be there when I use the environment Breakout-v3 instead of Breakout-v0. Here is what I have done to test it using this Python Notebook:

https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/16_Reinforcement_Learning.ipynb

You can also download the checkpoint files for Breakout by uncommenting the line with rl.maybe_download_checkpoint(env_name=env_name). I have copied the checkpoint files from the folder Breakout-v0 to Breakout-v3 and changed the Notebook to use the Breakout-v3 environment. This is just a simple hack so we use the TensorFlow weights and variables that were trained for Breakout-v0 on the Breakout-v3 environment.

When I run the Notebook and look at the images for the game-states and the corresponding actions that were taken, I still see the problem that I described above, that the agent does not always move according to the given action. I have previously debugged my own code and Gym, and I could not find a bug that would cause this behaviour. I could not step into the ALE-code, so it is possible that there is a bug there, which causes the wrong actions even with the Breakout-v3 environment.


About the documentation for OpenAI Gym, I would like to encourage you to significantly improve it. For example, the differences between v0 and v3 environments do not seem to be documented anywhere, and people would have to read deep inside the source-code to find out what the difference is, or they would have to ask here on GitHub if it has not already been answered somewhere else. This is very problematic for the end-user because the API semantics are very unclear. I have probably wasted 3-4 weeks of research and development because there either were no docs for the Gym API, or because the docs were wrong / misleading.

I get the impression that Gym is mostly maintained by one person so you might already be overloaded with work. But good documentation is such an important part of an open-source library, that you really ought to hire somebody to write it, if you don't have the time yourself, or if you feel your time is better spent elsewhere. As I recall, Elon Musk and others funded you guys with 1 billion dollars, so you should be able to afford to hire people to work on the documentation. Please inform OpenAI's managers that they need to allocate more resources to this area.

As it stands now, I consider Gym to be quite unreliable, probably in large part because the API is so poorly documented. I only used Gym in my own project because it was very easy to install with pip. I hope this comes across as constructive criticism.

tlbtlbtlb commented 7 years ago

It's likely that the Breakout game itself -- the code written in 6502 assembler in 1978 -- is not deterministic. Humans don't seem to have much trouble with it, and learning agents should aim to be equally robust.

tlbtlbtlb commented 7 years ago

In fact I verified (by adding print statements inside the ALE code) that with Breakout-v3, the actions are always sent to the game engine and never repeated, but the game itself implements a kind of momentum so that the paddle movement in a given step is affected by the last 3 or 4 joystick inputs. So the behavior in question is part of the 1978 code.

danijar commented 7 years ago

@tlbtlbtlb Thanks for the clarification! I believe this issue can be closed.

Hvass-Labs commented 7 years ago

@tlbtlbtlb Thanks very much for debugging the ALE code and explaining what might cause this behaviour in Breakout! Can I ask what you used to debug ALE? And did you use your own Gym test-code for this, or did you use the Python Notebook I linked to above?

tlbtlbtlb commented 7 years ago

I added print statements throughout this function: https://github.com/openai/atari-py/blob/master/atari_py/ale_interface/src/environment/stella_environment.cpp#L151

I just ran env.step(a); env.render() by hand for various values of a.

Hvass-Labs commented 7 years ago

OK, thanks again for testing and clarifying this issue.

On a side note, this ALE / Stella C++ code you linked to, is actually fairly well commented. Please consider doing the same in Gym, at least going forward when you write new or edit old code. It makes it far easier for others to understand your code, and for yourself 5 years from now. When I was debugging Gym there were almost no comments in the code.

DanielTakeshi commented 7 years ago

@tlbtlbtlb @Hvass-Labs I don't see a version 3 anymore in the code. I see version 4. Did version 3 become version 4?

JobJob commented 6 years ago

@DanielTakeshi yep, it seems that was done in this PR after the ALE version was updated