hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.14k stars 723 forks source link

[question] Specify a prior over action distribution? #903

Open juliuskittler opened 4 years ago

juliuskittler commented 4 years ago

This is a question/feature request for policy gradient based methods (e.g. A2C). Is it possible to specify a prior for the policy before training?

For instance, if I have 3 possible discrete Actions (gym.spaces.Discrete(3)), I'd like to explore the action space during training not with probabilities 1/3, 1/3, 1/3 but with e.g. 3/6, 2/6, 1/6.

What is the best solution for this as of right now? Many thanks in advance.

Miffyli commented 4 years ago

Unfortunately this is not very easy. If you want to fiddle/force the probabilities of actions to something specific, you need to modify the correct distribution (in your case, CategoricalProbabilityDistribution). If you are satisfied with manually setting actions (or doing sampling yourself), you can update action prediction in training loop here.

A sidenote: PG methods do not sample from fixed (1/3, 1/3, 1/3) distribution for actions, and explore by sampling from the current policy (where probabilities depend on the state and current network parameters). The situation has to be very unique if fixed and weighted sampling like yours makes sense.

juliuskittler commented 4 years ago

@Miffyli thanks a lot, that's very helpful. I get your point with the sidenote. Overwriting the actions before they get passed into the list (mb_actions.append(actions)) seems like a good idea to me :)

I could just overwrite the actions with my desired probability distribution for the first k steps. After the first k steps, I could stop overwriting them and use the actions suggested by the policy network instead.

Miffyli commented 4 years ago

@juliuskittler

I accidentally missed an important detail (I blame overcooked brain): If you sample your own actions (i.e. overwrite), this breaks the underlying sampling/update equations, as these actions are not sampled according to current policy (i.e. you are doing off-policy actions with on-policy algo). You could try using PPO which contains a limited "adjustment" for this, or use overwriting only at very initial steps like you suggested.

juliuskittler commented 4 years ago

@Miffyli I should have noticed that myself - you're right. Thanks again for the hint.

I might still give it a try after all for some initial steps. In my case, the custom environment is only fully explored if the agent focusses on taking the same action many times (and episodes end early otherwise). By overwriting the actions, I want to force it to explore the full environment.