hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.14k stars 723 forks source link

check_env warning - clarification for custom environment #1111

Closed amjass12 closed 3 years ago

amjass12 commented 3 years ago

Hi,

I have created a custom environment in which I have an agent in a gridworld needing to find optimal places to place 0's or 1's after which it receives a relevant reward- for learning purposes, I am doing DQN and encoding the agents position through a CNN (agents position is a one hot NxN matrix of 0's with a 1 where the agent is). I am using check_env to run through my custom environment and am getting a warning which as a result is confusing me about the correct implementation for my observation space and state space:

the observation space is:

self.observation_space = spaces.Box(low=0,high=1, shape=(38,38,1), dtype='uint8')

The actual grid itself is a pandas dataframe of shape (38,38). The step takes place through the pandas dataframe where ones and zeros are added in place. and then the state is grabbed in to the (38,38,1) shape as follows:

def grab_state(self):
        #adjacency matrix 
        matrix_pos = np.zeros((len(self.adjacency), len(self.adjacency)))
        matrix_pos[self.agent_pos[0], self.agent_pos[1]] = 1
        matrix_pos = matrix_pos.reshape(38,38,1)

returned in the steo function as follows (only snippet of bottom part)

self.state = self.grab_state()
return self.state, reward, done, {}

check_env does not throw any errors with the above implementation - however, my understanding is that is the observation shape should be: self.observation_space = spaces.Box(low=0,high=1, shape=(38,38))

if the dataframe itself is of (38,38) - however when i specify the observation shape like this (and leave it as 38,38 in grab_state()) check_env tells me its unconventional as it should either be 38,38,1 or a 1d array

My initial thoughts were that it should be (38,38) and then either in the DQN script or training loop, reshape the state to the correct 1,38,38,1 for the CNN

Now i am unsure as to the correct implementation of the observation space and the grid in the environment - does the (38,38,1) throw the agent of when it is actually acting on a (38,38) dataframe? I have found that the DQN doesn't learn even in this basic small gridworld and suspect it may be coming from the environment itself

thank you

Miffyli commented 3 years ago

A 3D input (38,38,1) would be treated with a CNN, and 1D input would be treated with a MLP. stable-baselines does not support other shaped inputs (they will be flattened into 1D). In your case you are doing it correctly and using the (38, 38, 1) shape for CNN, where "1" stands for number of channels (this is required because convolutional layers expect three dimensional images).

One spot for improvement is changing ones to 255. By default stable-baselines normalizes images by dividing them with 255, so in your case 1s become "1/255"s. This might not make a big difference, but could also be a reason why your network has hard time learning. You need to change "high=255" in observation space as well.

PS: I recommend moving on stable-baselines3, as its code is more refined and more actively developed (and works on PyTorch).

amjass12 commented 3 years ago

Hi @Miffyli ,

thanks you for the quick reply and for clarifying - so just as a follow up question to be clear: it doesn't matter that the observation space is (38,38,1) but the dataframe in which the agent make a move is (38,38)? with this mismatch, is the agent still seeing the whole gridworld?

I had high 1 as 1 is the highest value - its literally a matrix of only 0's and 1's ....

thanks for the recommendation also! I am mainly using this foe the check_env functionality as a check for my environment setup...

Miffyli commented 3 years ago

thanks you for the quick reply and for clarifying - so just as a follow up question to be clear: it doesn't matter that the observation space is (38,38,1) but the dataframe in which the agent make a move is (38,38)? with this mismatch, is the agent still seeing the whole gridworld?

Yes. There is no mismatch. (38, 38, 1) and (38, 38) represent the same information, the former just has a dummy dimension.

I had high 1 as 1 is the highest value - its literally a matrix of only 0's and 1's ...

Yes, but internally stable-baselines (and stable-baselines3) normalize the values by dividing them by 255. You might want to consider changing 1s to 255s when using stable-baselines for this reason.

You may close this issue if you have no further questions.

amjass12 commented 3 years ago

Thank you very much @Miffyli for taking the time to explain.. all is clear!