SAC model need box action space

glmcdona / LuxPythonEnvGym

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

MIT License

73 stars 38 forks source link

SAC model need box action space #84

Closed hokhay closed 2 years ago

hokhay commented 2 years ago

I am trying to use SAC algorithm to doing training. When I implement the SAC model, I got error and realized that it requests "box" action space instead of discrete action space.

I saw comment from Kaggle saying that it is suppose to run any of A2C, DDPG, DQN, HER, PPO, SAC, or TD3 right out of the box, so am I missing something important here?

Thanks Jason

nosound2 commented 2 years ago

I tried to use SAC and hit the same problem. I am not an expert so I don't know if there is a work around, but on the surface it seems that it can not be used.

glmcdona commented 2 years ago

This is a useful table: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html https://stable-baselines3.readthedocs.io/en/master/modules/sac.html

It seems baselines3 implementation of SAC only supports Box (continuous) action spaces. Converting the discrete action space to a continuous action space I don't think would work very well unfortunately, since a continuous action space outputs weights for every action option. Hope this helps! My comment of supporting all of those models out of the box is incorrect, sorry!

royerk commented 2 years ago

To build on the previous comments: in agent_policy.py there is self.action_space = spaces.Discrete(...) where SAC requires space.Box(...) for the action space.

A continuous action space is made for commands such as steering_angle, gas_pedal where values between in [0, 1] can be mapped to a command to apply.

In this game the actions are discrete (up, down, left, right, build city/spawn worker, etc.). To use SAC one would have to 1) change the output space to a Box and 2) create a wrapper that can transform a value from [0, 1] to an action. This seems like a hassle but not impossible, for example do nothing and the 4 directions could be inferred from a value. Such as: value in [0, 0.2[ is 'go up', value in [0.2, 0.4[ is 'go right', etc.