proroklab / popgym

Partially Observable Process Gym
https://popgym.readthedocs.io/en/latest/
MIT License
167 stars 11 forks source link

Suggestions for environments based on other POMDP types #4

Open smorad opened 2 years ago

smorad commented 2 years ago

Currently, all our environments could be classified as overcomplete POMDPs, where the number of unique latent states is greater than the number of unique observations. We are looking for environment suggestions based on other types of POMDPs, such as undercomplete POMDPs, weakly revealing POMDPs, latent MDPs, or $\gamma$-observable POMDPs.

If you have any environment suggestions, please post them here!

ashok-arora commented 4 months ago

@smorad Would it make sense to add the benchmarks from the DTQN paper?

smorad commented 4 months ago

The gridverse environments already exists as an external library, and I would prefer not to import it as they bring in additional dependencies. We already have a form of memory cards as our concentration environment. I guess this would leave carflag and heavenhell environments. Would you be willing to implement these? I can help guide you

ashok-arora commented 4 months ago

Sure, I will try my best to implement them if you could give some pointers on how to get started.

smorad commented 4 months ago

Sure! Basically, just implement each environment as a subclass of of POPGymEnv. This is just a gymnasium environment with an additional get_state method. This method should just return the underlying Markov state (e.g., position of agent and position of goal). Like any gymnasium environment, you'd need to implement the reset and step method, as well as define the observation_space and action_space.

Maybe we can start with carflag? Here is the description from the DTQN paper:

Car flag tasks a car with driving across on a 1D line to the correct flag. The car must first drive to the oracle flag and then to the correct endpoint. The agent observation is a vector of 3 floats, including its position on the line, its velocity at each timestep, and, when it is at the oracle flag, it is also informed of the goal flag’s location. The agent’s action alters its velocity; it may accelerate left, perform a no-op (i.e. maintain current velocity), or accelerate right. The agent receives a reward of 1 for reaching the goal flag, a reward of -1 for reaching the incorrect falg, and 0 otherwise.

So right off the bat, we know the observation and action space, as well as the reward function. You can just create the environments in popgym/popgym/envs/carflag.py. In the envs directory, there are a ton of other environments that you can look at for inspiration. For example, here is MineSweeper.

The first few lines might look like

from popgym.core.env import POPGymEnv
import gymnasium as gym

class CarFlag(POPGymEnv):
  def __init__(self):
    self.observation_space = gym.spaces.Box(shape=(3,), ...)
    self.state_space = gym.spaces.Box(shape=(3,), ...) # Underlying Markov state
    self.action_space = ...

  def step(...):
    self.car_position = self.car_position + self.velocity
    ...

  def reward(..): # Not necessary, but helper function
    ...

  def reset(..):
    self.goal_position = ...
    self.car_position = ...
    self.oracle_position = ...
    self.velocity = ...
    ...

  def get_state(..):
    # Return the position of the car, oracle, and goal
...
ashok-arora commented 4 months ago

Thank you so much for the detailed response. I'll fork the repo and send in a PR with the changes.

ashok-arora commented 4 months ago

I have added the code in #34. Could you please review it and give your feedback for improvement? I have tried to keep the style of code similar to the minesweeper.py file