inoryy / tensorflow2-deep-reinforcement-learning

Code accompanying the blog post "Deep Reinforcement Learning with TensorFlow 2.1"
http://inoryy.com/post/tensorflow2-deep-reinforcement-learning/
MIT License
207 stars 50 forks source link

two questions about the training loop #9

Closed amjass12 closed 3 years ago

amjass12 commented 3 years ago

Hi! I have a question about the training loop for clarification:

in this step:

 def train(self, env, batch_sz=64, updates=250):
    # Storage helpers for a single batch of data.
    actions = np.empty((batch_sz,), dtype=np.int32)
    rewards, dones, values = np.empty((3, batch_sz))
    observations = np.empty((batch_sz,) + env.observation_space.shape)

    # Training loop: collect samples, send to optimizer, repeat updates times.
    ep_rewards = [0.0]
    next_obs = env.reset()
    for update in range(updates):
      for step in range(batch_sz):
        observations[step] = next_obs.copy()
        actions[step], values[step] = self.model.action_value(next_obs[None, :])
        next_obs, rewards[step], dones[step], _ = env.step(actions[step])

        ep_rewards[-1] += rewards[step]
        if dones[step]:
          ep_rewards.append(0.0)
          next_obs = env.reset()
          logging.info("Episode: %03d, Reward: %03d" % (
            len(ep_rewards) - 1, ep_rewards[-2]))

      _, next_value = self.model.action_value(next_obs[None, :])

Whenever action value is called: is this working on one tuple (i.e one observation)? or a batch of tuples (batch_size length)? Im slightly confused since the action_value method uses the predict_on_batch function from tensorflow


def action_value(self, obs):
    # Executes `call()` under the hood.
    logits, value = self.predict_on_batch(obs)
    action = self.dist.predict_on_batch(logits)

when i print next_obs shape its a simple 4 tuple (4,) from which i would conlude this is just one state observation.

when i print observation - from which train_on_batch is used - it is indeed of the batch size (64,4)

I just want to clarify that whenevermodel.action_value is called, it is predicting on one observation (4,) and that the only time the batch size is used is for training on batch. If this is the case, for action_value, why is model.predictnot used instead of predict_on_batch

thanks for your time!

inoryy commented 3 years ago

Hello,

I just want to clarify that whenevermodel.action_value is called, it is predicting on one observation (4,) and that the only time the batch size is used is for training on batch.

Yes, that is correct.

If this is the case, for action_value, why is model.predict not used instead of predict_on_batch

At the time predict incurred some additional performance penalty, it might have been fixed since -- worth testing. If you look at the sources, predict does quite a bit of additional work until eventually calling to the same function as predict_on_batch.

amjass12 commented 3 years ago

Hi @inoryy ,

thank you for clarifying - this all makes sense!

I will test model.predict and see how this works - thanks yo again