Qestion about updating the agent

ZixuanLiu4869 commented 2 years ago

Hi, I have some silly questions about updating the agent. I know the general framework of training is as follow:

while True:
  # Make an initial observation.
  step = environment.reset()
  actor.observe_first(step.observation)

  while not step.last():
    # Evaluate the policy and take a step in the environment.
    action = actor.select_action(step.observation)
    step = environment.step(action)

    # Make an observation and update the actor.
    actor.observe(action, next_step=step)
    actor.update()

And this actor.update is used to update the agent. But I want to run the whole episode, and then at the end of the episode, I use my customized reward to update the agent network. The framework is like:

while True:
  # Make an initial observation.
  step = environment.reset()
  actor.observe_first(step.observation)

  while not step.last():
    # Evaluate the policy and take a step in the environment.
    action = actor.select_action(step.observation)
    step = environment.step(action)

    # Make an observation
    actor.observe(action, next_step=step)
  #compute my customized reward
  #update agent network

Then what should I do? I am new to this field and I would appreciate it if someone can help me!

cop4587 commented 2 years ago

Hi, What do you mean by "I use my customized reward to update the agent network."? Reward is calculated and generated by the environment, which you can (and usually should) customize to your problem, but reward it NOT used to update the network In Acme, update happens in run_experiment._LearningActor._maybe_train(), which delegate to, say if you are using DQN, SGDLearner. The learner will calculate the TD-error (by calling a q_learning alg in rlax, e.g, double_q_learning()), and finally the calculated TD-error is used to do back-propagation to update the network. I'm not sure whether I answered your question, it's possible I get your q wrong, anyway, hope helpful.

ethanluoyc commented 2 years ago

I think, in this case, you can consider creating a special agent for your problem that only calls learner.step at the end of an episode. Alternatively, there is always the possibility of writing your own custom training loop where you can decide when to update the agent. The default implementation in Acme covers the most common use case, for settings where you deviate from the default behavior, writing custom training loop or actors seems to be the only option that you can do right now.

ZixuanLiu4869 commented 2 years ago

Hi, What do you mean by "I use my customized reward to update the agent network."? Reward is calculated and generated by the environment, which you can (and usually should) customize to your problem, but reward it NOT used to update the network In Acme, update happens in run_experiment._LearningActor._maybe_train(), which delegate to, say if you are using DQN, SGDLearner. The learner will calculate the TD-error (by calling a q_learning alg in rlax, e.g, double_q_learning()), and finally the calculated TD-error is used to do back-propagation to update the network. I'm not sure whether I answered your question, it's possible I get your q wrong, anyway, hope helpful.

Hi, I know that reward is generated by the environment. Lets say, I want to record one episode and don't update the network this time. This episode has rewards that generated by the environment. Then I want to replace the rewards in the recorded episode with my own customized rewards and update the network. Is there a way that I can get the rewards from the recorded episode and change it?

ZixuanLiu4869 commented 2 years ago

I think, in this case, you can consider creating a special agent for your problem that only calls learner.step at the end of an episode. Alternatively, there is always the possibility of writing your own custom training loop where you can decide when to update the agent. The default implementation in Acme covers the most common use case, for settings where you deviate from the default behavior, writing custom training loop or actors seems to be the only option that you can do right now.

Do you have any examples about creating a special agent or writing custom training loop?

ZixuanLiu4869 commented 2 years ago

I think, in this case, you can consider creating a special agent for your problem that only calls learner.step at the end of an episode. Alternatively, there is always the possibility of writing your own custom training loop where you can decide when to update the agent. The default implementation in Acme covers the most common use case, for settings where you deviate from the default behavior, writing custom training loop or actors seems to be the only option that you can do right now.

Lets say, I want to record one episode and don't update the network this time. This episode has rewards that generated by the environment. Then I want to replace the rewards in the recorded episode with my own customized rewards and update the network. Is there a way that I can get the rewards from the recorded episode and change it?

google-deepmind / acme

Qestion about updating the agent #245