Closed muupan closed 4 years ago
Currently, in actor-learner training, self.t
is equal to self.optim_t * self.update_interval
. Thus, it simply counts the number of updates, multiplied by update_interval
. It is used to compare against target_update_interval
. This makes more sense than using cumulative_steps
since the target network should be updated after the model is updated enough times.
These are confusing indeed. I think there is room for improvement, but I leave it for future work.
self._cumulative_steps
was introduced for correctly counting steps in actor-learner training , but was invalid in non-actor-learner training. (See the internal repository for how it was introduced.) This PR makes it valid in both actor-learner and non-actor-learner training.In non-actor-leaner training, it is now equivalent to
self.t
. However,self.t
andself._cumulative_steps
are different in actor-learner training. That is why they cannot simply merged into one.