Open HunderlineK opened 2 years ago
I'm not 100% sure if this change is correct, but in the code we have:
def compute_loss(obs, act, weights): logp = get_policy(obs).log_prob(act) return -(logp * weights).mean()
Where we average the loss over the total number of the observations.
I have also tried averaging just over D and the result doesn't improve properly.
I'm not 100% sure if this change is correct, but in the code we have:
Where we average the loss over the total number of the observations.
I have also tried averaging just over D and the result doesn't improve properly.