tomsilver / pddlgym

Convert a PDDL domain into an OpenAI Gym environment.
MIT License
207 stars 59 forks source link

Interoperability with Stable Baselines #58

Closed meneguzzi closed 3 years ago

meneguzzi commented 3 years ago

I've recently tried to run PDDLGym with some of the RL Baselines maintained in this repo: https://github.com/DLR-RM/stable-baselines3

However, on the first try of training PDDLEnvBlocks-v0 I get an error as the baselines try to Wrap the environment somehow. I have to confess I'm very new to Gym and its internals, so it might be my own lack of knowledge in configuring things. In any event, the code I tried to use was a trivial modification of the tutorial in their repository:

from stable_baselines3 import PPO
import gym
import pddlgym

env = pddlgym.make("PDDLEnvBlocks-v0")

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
        obs = env.reset()

env.close()

This generates an error (eventually) in "stable_baselines3/common/vec_env/dummy_vec_env.py", line 31, in self.buf_obs = OrderedDict([(k, np.zeros((self.num_envs,) + tuple(shapes[k]), dtype=dtypes[k])) for k in self.keys]) TypeError: 'NoneType' object is not iterable

I will try to investigate this further, but I just thought this issue report would be useful to other trying to use the same baselines on PDDLGym.

ronuchit commented 3 years ago

Hi, thanks for reporting this. Unfortunately, PDDLGym is not out-of-the-box compatible with reinforcement learning baselines such as stable_baselines (sorry, we can see why this would be confusing!). The issue is that PDDLGym operates on a logical representation of states and actions, as you can see by printing out some of the states in PDDLGym, whereas stable_baselines and its cousins assume vectorizable state/action spaces and use neural networks for all function approximation. If you'd like to use PDDLGym with stable_baselines, what you'll want to do is write a wrapper around a PDDLGym environment that converts from our literal-based state representation to a numeric vector representation. However, the details of how to do this are unclear in general & would probably need to be specific to your application, I imagine. If you make any progress on this front, though, do keep us updated!

I'll close this particular issue, but feel free to open a new one or email us directly. Sorry about this!

meneguzzi commented 3 years ago

Thank you very much for responding so quickly!

I imagined this could be the problem. Since I got your attention, from reading your ICAPS paper, I'm assuming you ground everything, but I could not find whether you store all ground literals you reached somewhere. My naive initial idea for a wrapper (that works for a single problem) would be to get the reachable set of literals using a relaxed planning graph, build a vector of that size and use some sort of unique hashing function to map literals in the state to indexes in a bit vector. If that makes sense to you, and if I may pick your brain for a moment, can you help me locate which classes I should take a closer look to implement that? If you don't have time for that it's fine, I can go about looking at it myself, but I imagine this could be a trivial question for you.

gehring commented 3 years ago

If PDDLGym doesn't have any code you can use directly, you can probably adapt my implementation from here without too much effort. You'd probably need to change how things like which attribute/method are used to check arity and stuff since the base classes differ from PDDLGym's but the core logic you need to compute indices is there (if I understood what you are asking).

meneguzzi commented 3 years ago

@gehring thanks for pitching in! Just so I understand you better, are you suggesting I can adapt your code in array.py to build the wrapper, or replace PDDLGym altogether with your own environment class https://github.com/gehring/pddlenv/blob/8997cf68d70965e83dc9f88165d1ed605ae4cb78/pddlenv/env.py#L110 ?

gehring commented 3 years ago

I'm suggesting you adapt the code in array/indexing.py (and array/encoding.py to see how it's used to make binary arrays). PDDLGym will have much better support and is under active development. I'm still maintaining PDDLEnv but I don't have any plans to keep extending/improving it (also, it implements DeepMind's dm_env API instead of the gym API if that matters to you). PDDLEnv also only supports simple STRIP-like problems while PDDLGym has more features in general.

Though you are welcome to use PDDLEnv (I use it still and it's reasonably well tested for my use cases). I just think building off of PDDLGym is a safer bet in the long term unless you're willing to extend PDDLEnv yourself if you run into any limitations.

meneguzzi commented 3 years ago

Thanks for the clarification!

ronuchit commented 3 years ago

Great yes, this is what I had in mind with the vectorization I mentioned above. This sort of state representation should be compatible with Gym. In your wrapper class, you'll also need to define some of the necessary fields that the Gym API requires, such as self.observation_space and self.action_space. You'll have to decide on the best representation of actions (https://gym.openai.com/docs/#spaces), whether it be a Box or a Discrete, or a combination.

meneguzzi commented 3 years ago

So, just to give a report back, I've tried adapting @ronuchit 's code, and while I can see the rationale behind a lot of it, some of the details of your indexing scheme escape me.

The data structures for Literals, Predicates arities and whatnot were different enough that I'm not entirely sure I'm mapping things correctly (though it seems to work). Was the way you index intended to allow generalization across different problems of the same domain?

The naive approach I used here was to just wrap the PDDLEnv in another class and create subclasses for the Action and State spaces as, respectively, Discrete and MultiBinary (turns out the baselines have a bug with subclassing like I did, but I used a workaround). For the state space, I simply create an ordered list with PDDLProblem.all_grounded_literals and a dictionary for the indexes to create binary vectors for observations in step and reset as a flat binary vector. Similarly, I create an ordered list with all_ground_literals for the action space, and use the indexes as a discrete state space. This seems to work to the extent that the invocations of actions and states are consistent between resets of the same problem environment, but I have a nagging suspicion that this naive way might be too naive.

gehring commented 3 years ago

@meneguzzi, did you mean to ping me instead of @ronuchit?

meneguzzi commented 3 years ago

Yep, damned autocomplete :-D.

gehring commented 3 years ago

The core idea of this indexing scheme is pretty straightforward. Indexing is first done by arity. For each predicate/action, a tuple of indices is generated, one index for each "variable" plus a batch dim index and an index for the predicate/action so for arity n you get n + 2 indices. Conceptually, you are defining several binary arrays, one per arity, with n+2 dimensions for arity n. The output of compute_indices is a tuple of the indices grouped by arity and the corresponding shapes of the corresponding binary arrays. If you want everything in a single array, ravel_literal_indices can take this as input to generate the indices for the flattened and concatenated arrays.

The code is written generally enough that it doesn't care if it is dealing with predicates or actions. As long as it has an arity, it should work. Don't let the code intimidate you, most of it is just book keeping to group things by arity or computing offsets. There really isn't anything fancy happening.

Indexing is only guaranteed to be consistent across problems of the same domain if they use the same set of objects (or more generally objects that have the same lexicographical ordering but I wouldn't really rely on that). This is because those initial tuples of indices are based on the sorted list of objects and sorted list of predicates/actions.

I'm not sure I understand what you are describing in your last paragraph. Are you mixing code between pddlenv and pddlgym? In general, the easiest would be to use the ravel and unravel methods to convert between a single index and indices into the objects and predicates/actions. It also allows you to generate a pddl state without needing to ever generate the full list of grounded predicates. If you want to index into a list of grounded predicates/action, I believe you'll have to make sure it is sorted the same way as the indexing, that is first sorted by arity, then lexicographically within the same arity. I would make sure that generating a state from an indexed state gives back the same state.

btw, if you are looking for usage examples, you should check out the relevant test if you haven't already.

gehring commented 3 years ago

It might help to understand why things are index this way to check the neural logic machines (NLM) paper. That architecture assumes inputs are grouped by arity so the binary arrays that my code generates are compatible with what NLMs expect.