Closed Wesleyliao closed 3 years ago
Looking at the ACER's code regarding mus, you seem to be on right track (proba_step
function). You still should check if it is modified along the way before pushed to the replay memory. And yes, masks
is same as dones
for MLP policy (actually not used, I think).
Note thought that there is no guarantee this setup will work: while ACER is (basically) A2C with experience replay and off-policy adjustments, there might be something that breaks with this kind of setup where samples come from a completely different policy.
I think this effectively becomes DQfD
A sidenote, but not quite. An important part of DQfD is the large-margin training which is necessary to obtain any sensible Q-values (without this the actions with no samples will have undefined values).
I'm currently training an agent (discrete actions) to imitate a human expert player from experiences generated with
generate_expert_traj
. I found that the supervised learning method of using pretrain isn't very effective because the state space an expert experiences is so far from that of a new agent, that by the time a new agent refines its policy to get there it has already overwritten / forgotten the pretraining.So I've been trying to inject experiences into replay buffers directly for off-policy learning. For DQN I think this is fairly straightforward as there's a
replay_buffer_add
interface where I can addobs_, action, reward_, new_obs_, done
tuples from the recorded.npz
file. I think this effectively becomes DQfD (https://arxiv.org/pdf/1704.03732.pdf)I also want to try this for ACER but I'm not exactly sure how to retrieve the
mus
values . I understand from the paper that mu represents the probability distribution over actions (discrete case) given the state. Would that just bemus = ACER_model.proba_step(obs, states, dones)
or is itACER_model.action_probability(...)
? Then is it justbuffer.put(enc_obs, actions, rewards, mus, dones, masks)
wheremasks
is essentially the same asdones
in the case of MLP policy?Let me know if that looks right / if I'm missing something.
Thank you!