[question] Pretraining with custom GoalEnv environment

hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

http://stable-baselines.readthedocs.io/

MIT License

4.14k stars 723 forks source link

[question] Pretraining with custom GoalEnv environment #1108

Closed OGordon100 closed 3 years ago

OGordon100 commented 3 years ago

I'm very interested in using expert examples with an agent such as DQN, and adding HER (my task cannot be simulated and has a partially observable markovian).

I've had a dig through the source code - the ExpertDataset class only works with Discrete or Continuous action spaces. GoalEnv (required for HER, see #198 and #750) wants an observation space made of a dict of three spaces, with keys of "observation" "achieved_goal" and "desired_goal". It is therefore impossible to use .pretrain() with a GoalEnv class, even though it seems possible to manually add to the replay buffer (see #1090 , but I don't know what I'm doing!)

I know imitation learning was moved away to the imitation repo with sb3, but I am forced into using tf 1.15.

Would anybody know the best way to proceed?

Miffyli commented 3 years ago

There would not be a simple way to do this. It would require extensive modifications some of which you pointed out. I would highly recommend getting a PyTorch/TF2 environment so you can use more actively maintained repositories (e.g. by using conda/pipenv environments).

As a sidenote: I am not sure if HER can be used with imitation learning this. I would avoid haphazardly trying algorithms in situations they were not originally designed for. If you want to try to solve a problem you have, I would look into existing solutions in offline RL and such.

Edit: For help with practical things, I recommend checking out RL discord.

OGordon100 commented 3 years ago

Thanks @Miffyli

I'm pretty sure it could be used - there's a wealth of papers on combining hindsight with imitation. And really, all we are doing is giving the parameters a head-start, so I don't see why it would not work?

I've also done a little more digging - at present all algorithms like DQN become compatible with GoalEnv by using a function called convert_dict_to_obs - it literally just stacks them into a big array, and then the replay buffer converts it back to a dict. I can see it being non-trivial.

Dang, back to the drawing board - thanks for the discord!