Here 0 specifies that the action probability distribution is the first of the 4 probability distributions, if this is the case then your actions are taken based on the first frame or the 0th observation of the stacked_frames. Is that right?
Assuming my first assumption is right. There is a line in the main_tf_reinforce_space_invaders.py file:
Here the new observation is getting stored with action taken based on the 0th observation in the stacked_frame, If this is the case why does this work while training the agent? Are the probability distributions when the observations are fed in different from the labels?
Hi Phil, huge fan of your work.. I have two questionsn regarding policy gradients TensorFlow for SpaceInvaders:
1.In the reinforce_cnn_tf.py and in the choose_action function there is a line:
probabilities = self.sess.run(self.actions, feed_dict={self.input: observation})[0]
Here 0 specifies that the action probability distribution is the first of the 4 probability distributions, if this is the case then your actions are taken based on the first frame or the 0th observation of the stacked_frames. Is that right?
observation, reward, done, info = env.step(action) observation = preprocess(observation) stacked_frames = stack_frames(stacked_frames, observation, stack_size) agent.store_transition(observation, action, reward) (this one)
Here the new observation is getting stored with action taken based on the 0th observation in the stacked_frame, If this is the case why does this work while training the agent? Are the probability distributions when the observations are fed in different from the labels?