mreitschuster / RLbreakout

OpenAI's gym + stable_baselines3 + ALE -> training model to play ATARI Breakout
5 stars 0 forks source link

Discussion about devil in the detail #1

Open hh0rva1h opened 2 years ago

hh0rva1h commented 2 years ago

I want to follow up regarding our previous discussion at the stable_baslines3 contrib repo, I hope this could serve as the right place, if not please feel free to close this and we continue the discussion elsewhere. Here https://github.com/mreitschuster/RLbreakout/tree/master/learning/4_tuning you write:

But applying the model to v5, we see it performing poorly on the stochastic environment. My guess is that the model just memorizes 30 different sequences.

Just to be sure, did you make sure to make a fair comparison by setting full_action_space=False? Of course it can be discussed what a fair comparison should look like, but also I would make sure the frameskip matches between your trained v4 agent and the v5 environment (According to https://www.gymlibrary.ml/environments/atari/ v5 sets frameskip=5, while the NoFrameskip version of v4 sets frameskip=1), let me quote the following:

If frameskip is an integer, frame skipping is deterministic, and in each step the action is repeated frameskip many times

My definition of a fair comparision would be, testing of the NoFrameskip-v4 agent in a v5 environment with identical action space and identical deterministic framekskipping variable (this would imo make a comparision between non sticky actions vs sticky actions and thus test whether the agent truly memorizes something or not).

mreitschuster commented 2 years ago

thank you for pointing both pieces out. i havent adjusted the action space. I will also need to look into the deterministic frameskip. I ran my model with frameskip via the wrapper, but didnt consider frameskip inbuilt into the environment

hh0rva1h commented 2 years ago

I see thanks for the quick reply. I guess the action space part is really destroying the comparison, since the full action space enumeration is the following:

Num | Action
0 | NOOP
1 | FIRE
2 | UP
3 | RIGHT
4 | LEFT
5 | DOWN
6 | UPRIGHT
7 | UPLEFT
8 | DOWNRIGHT
9 | DOWNLEFT
10 | UPFIRE
11 | RIGHTFIRE
12 | LEFTFIRE
13 | DOWNFIRE
14 | UPRIGHTFIRE
15 | UPLEFTFIRE
16 | DOWNRIGHTFIRE
17 | DOWNLEFTFIRE

while the reduced action space of Breakout is

Num | Action
0 | NOOP
1 | FIRE
2 | RIGHT
3 | LEFT

So given that right turns into an action that does nothing and left turns into right I would expect the agent to not work at all with the v5 scenario.

Looking forward to your updates :-).

mreitschuster commented 2 years ago

thank you! For now i adjusted the code (inside the objective funciton of 4.0_wrapperOptuna_PPO.py) to

        if env_params['env_id']=='Breakout-v4':
            env_kwargs={'full_action_space'         : False,
                        'repeat_action_probability' : 0.,
                        'frameskip'                 : (2,5,)
                        }

        elif env_params['env_id']=='BreakoutNoFrameskip-v4':
            env_kwargs={'full_action_space'         : False,
                        'repeat_action_probability' : 0.,
                        'frameskip'                 : 2)
                        }

        elif env_params['env_id']=='ALE/Breakout-v5':
            env_kwargs={'full_action_space'         : False,
                        'repeat_action_probability' : 0.25,
                        'frameskip'                 : 2
                        }

Reasoning: we want same determinstic frameskip for comparability so for BreakoutNoFrameskip-v4 and v5 i chose the same number. For v4 i keep it as (2,5,) as we need the stochasitcity to test the hypothesis ("determeinisticly trained does not do well in stochastic env"), and I argue it is better to have the minimum of (2,5,) aligned with the others because going from fast to slow should be easier than going from slow to fast

hh0rva1h commented 2 years ago

Tbh, I'm not really sure about the frameskip=2 part, especially with regard to the frame skipping happening in the wrapper.

Deepmind in their publications has frameskipping (not to be confused with the stochastic frameskipping approach) with k=4 and repeating the same action for all skipped frames, so to achieve this, I would set 'frameskip=1' in the environment and configure the Atari wrapper with k=4 or the other way around, see next paragraph. See also the assertation in the upstream openai atari wrapper: https://github.com/openai/gym/blob/e9d2c41f2b233864c59cd9f2d9240f40e14da8b9/gym/wrappers/atari_preprocessing.py#L65

So having frameskip not equal to one in both the wrapper and the environment itself seems to make little sense. I guess achieving the Deepmind setup should also be possible with setting frameskip=4 in the environment and frameskip=1 in the wrapper, would we agree on that?

So if you set frameskip=1 in the wrapper, I would go for frameskip=4 in the environment and if you choose to handle frameskip=4in the wrapper I would go for frameskip=1 in the environment (however then you might run into trouble with the stochastic frameskipping, so I guess the easiest is, to set frameskip=1 in the wrapper and default to 4 in the env).

hh0rva1h commented 2 years ago

I think I missed your going from fast to slow argument. Interesting thought, I think I get your intuition about that, would be interesting to see how it performs. But I guess it could also not matter much, if the action that is perfectly fine repeated 2 steps to get near the ball, followed by 2 noops, it could be really contraproductive when it unexpectedly gets executed 4 times. When the agent is used to the action having broader impact since they get executed 4 times during training and unexpectedly the action only gets executed 2 times then the situation is quite similar I guess. I would think it works best in a (3,5) stochastic setting if trained with frameskip=4: sometimes one repetition is missing or too much, but overall the agent should have approximately chosen the right action.

I'm wondering however how repeat_action_probability=0.25 works with regard to the frameskip argument. Does it decide to repeat the previous action frameskip times or does it decide that individually? That should make quite some difference. I would think the former is the case.

mreitschuster commented 2 years ago

Thank you! I adjusted the code to get rid of the redundant frameskips - i only use the ones from the environment now. The reason is that I have to define stochastic frameskip (for the run in v4) and that is only available in the environment, not the wrapper.

I defined the stochastic frameskip as (N-1,N+1), as you proposed with N being the deterministic frameskip of the alternative environment.

Just restarted the training. Should take a day. Once I confirm it running properly Ill upload the code

mreitschuster commented 2 years ago

I have now commited a new verison of the code incorporating your advice. It will train once for each environment and for 1 training spit out an arbitrary number of eval_environments (i had to change the on_step() of the TrialEvalCallback a bit). It is now at 700 steps with 400 fps (likely drop to 350 when eval kicks in). I am training the 3 training env in parallel (I like optuna for making that really easy by just executing the code 3 times and it will pick up the next waiting trial in the db). So should be done in 8h.

mreitschuster commented 2 years ago

image

In my view the graph 3-5 in the eval section are the relevant ones - one callback for each eval environment: eval/v4/mean_reward eval/v4NoFS/mean_reward eval/v5/mean_reward not sure what the first two are showing, as all 3 callbacks are writing to (and overwriting each others) number under the normal eval/mean_reward and eval/mean_reward keywords - so I assume the last callback writing to it has its number in there.

Orange FrozenTrial(number=0, ... , params={'train_env_id': 'Breakout-v4', 'frame_stack': 4, 'n_envs': 8, 'frameskip_env': 4} ...)

Blue FrozenTrial(number=1, ... , params={'train_env_id': 'BreakoutNoFrameskip-v4', 'frame_stack': 4, 'n_envs': 8, 'frameskip_env': 4} ...)

Red FrozenTrial(number=2, ... , params={'train_env_id': 'ALE/Breakout-v5', 'frame_stack': 4, 'n_envs': 8, 'frameskip_env': 4}...)

So my conclusion would still be: training on a stochastic environment trains slower, but generalizes better to other environments. But trained on NoFrameskip-v4 - a purely deterministic environment trains fast, but transfers poorly to a stochastic environment .

There still was an error in the end, as i closed the callback, instead of the environments in the code. It prohibited and further trials (eg with other frame_skip numbers), but shouldnt affect the first 3 trials.

I have ammended the readme. Would you mind having a look?

hh0rva1h commented 2 years ago

Thanks for the comparison, looks good to me, I only skimmed over the code, did not have time yet to look into it in detail. Makes sense that the deterministically trained one does not generalize well since it does not expect different behavior at all while the others do. Interestingly the frameskip trained one generalized better to the sticky action setting than the other way around. Would definitely be interesting to see this comparison with more algorithms and longer runtime (~ 50 million steps). I was still hoping that the deterministically trained one would do a little bit better in the other environments.

hh0rva1h commented 2 years ago

Also in the publication introducing the sticky actions protocol they mention:

We observed decreased performance for DQN and Sarsa(λ) + Blob-PROST only in three games: Breakout, Gopher, and Pong.

Judging from the scores in table 8 the sticky actions seems to make Breakout quite considerably harder (their DQN score is maximal with 75.1 after 50 million steps and regresses afterwards), while DQN should be around a stable 300 when trained with the non stochastic protocol.

Now I'm asking myself whether memorization is really at fault here or whether the stochasticity ruins even good policies that do not suffer from memorization. Could well be the case given that Pong and Breakout are pretty similar in the sense that the agent needs to control a cart to catch a ball. Stickiness could really ruin getting the card to the right spot, if at a certain point you need to go left when due to the stickiness the cart moves right and you can't really correct that in the next steps. And 25% stickiness seems to me like it could be too much for this setting. I guess a way to find out would be to train longer also with different algorithms and see whether some of them manage to successfully reach higher scores, or whether the stickiness introduces a boundary that is hard to overcome.

mreitschuster commented 2 years ago

your argument could account for the difference between Blue and Red, as well as Orange outperforming Red (a bit). But the comparison of Blue and Orange shouldnt suffer from this problem.

Based on some previous runs, which I am doing at the moment for the final video, where i wanted to look at "when does it converge" I trained for 1e8 and still am nowhere near convergence. I estimate that the training phase would be at least 1e9. And I am running into issues with the linear learning rate decrease. I suspect that using an exponential decrease might be more appropriate (and has the advantage of being scale invariant, so extending a training ex post is smoother).

Of course training long enough even Blue might generalize better - if we follow Tishby's information bottleneck explanation arguing that generalization in DL happens through the stochastic gradient descent and take much longer than the training needed to score high. That might be around 1e10 steps.

Already running 1e8 is a strain on the resources available to me. I think to investigate this more thoroughly we would need to move to a simpler environment. I was thinking removing the whole "learning from picture" aspect in a wrapper and manually transforming the picture into a more computer-friendly representation (coordinates + velocity of the ball, position paddle, matrix for the removed blocks), hoping that would speed up the training by making the learning of the representation easier. The aimbot wrapper - which simplifies the representation - already helped training speed quite significantly.

But on the other hand this might just lead further down the rabbit hole, as it might turn out that then the problem is trivial.