There were two issues with the perplexity evaluator:
it crashed during validation because the list of hypotheses consisted of batches of different length and numpy was not able to convert it to np.array and np.mean crashed
it did not take into account masking and included padded position with xent of 0.0 into the mean
There were two issues with the perplexity evaluator:
np.array
andnp.mean
crashed