hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 724 forks source link

[Feature Proposal] Intrinsic Reward VecEnvWrapper #309

Open araffin opened 5 years ago

araffin commented 5 years ago

Recent approaches have proposed to enhance exploration using an intrinsic reward. Among the techniques:

The way I would do that:

  1. Using a VecEnvWrapper so it is compatible with all the algorithms without any modifications
  2. I would use a replay buffer inside the wrapper (this requires more memory but is quite general)
  3. the different parameters of the wrapper:
    • network: the network to use (for forward/ RND / ... models), could a cnn or a mlp
    • weight_intrinsic_reward: scale of the intrinsic reward compared to the extrinsic one
    • buffer_size: how many transitions to store
    • train_freq: train the network every n steps
    • gradient_steps: how many gradient steps
    • batch_size: minibatch size
    • learning_starts: start computing the intrinsic reward only after n steps
    • save/load: save and load weights of the network used for computing intrinsic reward

Drawbacks:

Related issue: #299

hill-a commented 5 years ago
  1. Using a VecEnvWrapper so it is compatible with all the algorithms without any modifications

Yes, good idea. definitly agree there.

  1. I would use a replay buffer inside the wrapper (this requires more memory but is quite general)

I'm not sure about wrappers and replay buffers at the moment, due to the inherite issues of modifing the states over time (by VecNorm for example). A rework of this needs to be done in general I think (like placing the replay buffer in a specific wrapper by default)

  1. the different parameters of the wrapper:

All agreed

Drawbacks:

  • this would slow down the speed (because of extra learning involved)
  • uses more memory (because of the replay buffer and the network)

It would slow down and use memory, but this is not by an order of magnitude, this is by a small factor increase at wost, imo this is not that bad of a problem.

huvar commented 5 years ago

As far as I understand from the formal definitions, the state of recurrent units such as LSTM are part of the environment (aka universe) state. So, should not they be included in the curiosity calculations? This would be a drawback from calculating it in the Env where is cannot be accounted for.

araffin commented 5 years ago

the state of recurrent units such as LSTM are part of the environment (aka universe) state.

I would rather say that the state of the LSTM, which is in fact the memory cell and the hidden state, is part of the agent policy, not the environment.

So, should not they be included in the curiosity calculations?

I'm not aware of works that uses LSTM's state of the agent policy for creating an intrinsic reward... are you referring to a particular paper?

NeoExtended commented 4 years ago

Hey *, I am currently working on my thesis and am struggeling a little bit with an environment which is hard to explore. Therefore I thought it would be great to try to implement curiosity and see how it works. I know the project is currently heading towards 3.0 with the switch to the PyTorch backend, but would you still be interested in a PR once im done?

Miffyli commented 4 years ago

@NeoExtended

Sadly we will not be taking any new features/enhancements to v2 right now, like you mentioned. This could be added in the later versions after v3.

But, if you wish to try out exploration techniques in your environment, take a look at Unity's ML-agents and their PPO. They support exploration bonuses.

araffin commented 4 years ago

but would you still be interested in a PR once im done?

This feature should be a gym.Wrapper so independent of the backend.

This could be added in the later versions after v3.

I agree, or at least reference the implementation in the doc.

Miffyli commented 4 years ago

This feature should be a gym.Wrapper so independent of the backend.

Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this.

NeoExtended commented 4 years ago

This feature should be a gym.Wrapper so independent of the backend.

Hmm actually that is a good point. At least some of the curiosity methods (like RND, predict output of a random network) could be done simply like this.

Exactly. I think we can abstract all the network code by reusing functions from the policies (for example the mlp_extractor method). Just the training process of the networks would be included in the wrapper and dependent on the backend, but that shouldn't be too hard to change afterwards.

NeoExtended commented 4 years ago

I finally experimented with the wrapper today and noticed, that we somehow need to train the RND networks inside of the wrapper. Currently i don't see a way of doing this independent of the backend, since i need a new tf session for this. Or is there a different option to train the networks? (I am currently creating the target and predictor networks via nature_cnn/mlp_extractor methods).

Miffyli commented 4 years ago

@NeoExtended

You should be able to create different sessions, although I am not familiar with this (google might help you here). You could also try using PyTorch to implement the RND.

However, this is not a stable-baselines related issue per-se, so you may close this issue if you have no further questions related to stable-baselines.

NeoExtended commented 4 years ago

Its been a bit longer than just a few days, but i finally implemented the RND curiosity wrapper. I know you will not include it until after v3, but just for those interested i already uploaded the code here.

The class it is derived from the new BaseTFWrapper class which just uses copied code from BaseRLModel and ActorCriticRLModel to realize saving and loading. This is quite ugly and would require some refactoring, but since the code needs to be rewritten anyway for v3 i did not invest the time.

I did some testing to verify the implementation and was able to train a PPO agent on Pong using intrinsic reward only. As expected the agent optimizes for episode length instead of trying to win (and maximizing extrinsic reward). rnd_pong_episode_length rnd_pong_episode_reward

If you are interested i would update the wrapper for v3 as soon as it is released.

Miffyli commented 4 years ago

Looks very promising, and quite compact by reusing stable-baselines functions. This would be a good addition to v3, but it would be a nice tool outside stable-baselines, too, as it is independent of the RL algorithms :).

Things should be cleaner by v3, being PyTorch env, so there is no need to delve into cleaning up the BaseTFWrapper for now.

m-rph commented 4 years ago

I am not sure if this is the correct approach. In RND the critic network uses two Value heads to estimate the two reward streams so implementing it as a wrapper will block this approach unless the wrapper returns both rewards in the info but that is kind of messy.

Miffyli commented 4 years ago

Most of the mentioned curiosity methods just add the reward to the extrinsic reward of the environment, and still show improvement over previous results. I agree the dual architecture as presented in RND paper could be better (as results indicate, especially with RNN policies), but I do not think it is worth the hassle to implement before we implement these curiosity wrappers with single reward stream.

m-rph commented 4 years ago

I agree with you, perhaps both streams could be available in the info dict? This will be quite useful wrt evaluating the performance and debugging the algorithm.

FabioPINO commented 4 years ago

I am sorry for getting in the middle of your conversation. I am really interested in the application of Intrinsic Reward in the learning pipeline. I noticed that @NeoExtended created an RND curiosity wrapper but I am not very acquainted on how to deal with it. Is it possible to have a representative example of how to use this instrument?

NeoExtended commented 4 years ago

Hey @FabioPINO, sorry for the late answer, i am quite busy at the moment. I will try to get you a minimal example which I used to generate the plots above by the end of the week.