Closed nowittynamesleft closed 2 years ago
there are other ways to promote higher lengths with different kinds of penalties but dividing seems to work fine. I do notice more repeats now though, so there is a tradeoff here.
Maybe there can be a fractional exponent hyperparameter on the divided length that can be tuned in order to find the best performance on a validation set? Or maybe this problem will go away with a good enough architecture/way of training rather than changing the search strategy in inference.
Regardless, this has now been implemented.
give some kind of shortness penalty to the scores.
Given the negative log-likelihood, divide it by the length to make longer sequences more likely, because it will have a smaller negative value.