hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

[question] How to insert experiences into replay buffers? #1090

Closed Wesleyliao closed 3 years ago

Wesleyliao commented 3 years ago

I'm currently training an agent (discrete actions) to imitate a human expert player from experiences generated withgenerate_expert_traj. I found that the supervised learning method of using pretrain isn't very effective because the state space an expert experiences is so far from that of a new agent, that by the time a new agent refines its policy to get there it has already overwritten / forgotten the pretraining.

So I've been trying to inject experiences into replay buffers directly for off-policy learning. For DQN I think this is fairly straightforward as there's a replay_buffer_add interface where I can add obs_, action, reward_, new_obs_, done tuples from the recorded .npz file. I think this effectively becomes DQfD (https://arxiv.org/pdf/1704.03732.pdf)

I also want to try this for ACER but I'm not exactly sure how to retrieve the mus values . I understand from the paper that mu represents the probability distribution over actions (discrete case) given the state. Would that just be mus = ACER_model.proba_step(obs, states, dones) or is it ACER_model.action_probability(...)? Then is it just buffer.put(enc_obs, actions, rewards, mus, dones, masks) where masks is essentially the same as dones in the case of MLP policy?

Let me know if that looks right / if I'm missing something.

Thank you!

Miffyli commented 3 years ago

Looking at the ACER's code regarding mus, you seem to be on right track (proba_step function). You still should check if it is modified along the way before pushed to the replay memory. And yes, masks is same as dones for MLP policy (actually not used, I think).

Note thought that there is no guarantee this setup will work: while ACER is (basically) A2C with experience replay and off-policy adjustments, there might be something that breaks with this kind of setup where samples come from a completely different policy.

I think this effectively becomes DQfD

A sidenote, but not quite. An important part of DQfD is the large-margin training which is necessary to obtain any sensible Q-values (without this the actions with no samples will have undefined values).