I notice that in you implementation of sac, the target value is estimated by two q functions, why should be like this?
target_q_values = torch.min( self.target_qf1(next_obs, new_next_actions), self.target_qf2(next_obs, new_next_actions), ) - alpha * new_log_pi
Does you want to avoid the overestimate of q-function like double q-learning?
Dear author,
I notice that in you implementation of sac, the target value is estimated by two q functions, why should be like this?
target_q_values = torch.min( self.target_qf1(next_obs, new_next_actions), self.target_qf2(next_obs, new_next_actions), ) - alpha * new_log_pi
Does you want to avoid the overestimate of q-function like double q-learning?