In sample_trajectories, in the explore phase, after the mixing of model_fwrd_probs with uniform distribution, the sum of probabilities of the resulting model_fwrd_probs is greater than one.
Is it intended? If so, what's the logic behind it?
I see that mask_and_norm_forward_actions normalizes them back to one. But why do they happen to be un-normalized?
In
sample_trajectories
, in the explore phase, after the mixing ofmodel_fwrd_probs
with uniform distribution, the sum of probabilities of the resultingmodel_fwrd_probs
is greater than one. Is it intended? If so, what's the logic behind it? I see thatmask_and_norm_forward_actions
normalizes them back to one. But why do they happen to be un-normalized?