Treatment of <unk> token in PTB

hyunjik11 commented 4 years ago

Looking at: https://github.com/harvardnlp/TextFlow/blob/master/main.py#L124 it seems like for the PTB data, the vocab_size is 51, and it includes not only <unk> but also '<' and '>'. However in the ptb dataset, '<' and '>' only seems to occur in <unk> and never on their own. So it seems like the correct vocab size should be 49. And looking at the integer representation of the characters, it seems like the integer corresponding to <unk> is never used, and <unk> in the text is always represented as 5 integers, each corresponding to '<', 'u', 'n', 'k', '>'. This could be a minor bug, but given the high frequency of <unk> in the PTB dataset, I'm suspecting it might inflate the log likelihoods compared to the correct representation.

On the other hand, because the maximum sequence length is 288, this means with the bug you get slightly less data than with the fix. I think the effect of this would be negligible compared to the inflation above, but it would be worth verifying.

zackziegler95 commented 4 years ago

That is correct, and it is not a bug. The standard in evaluating PTB is to represent with the characters < u n k > (see e.g. https://arxiv.org/pdf/1803.08240.pdf), so our implementation is consistent with the literature. While this seems a bit silly, the reason this is done (at least originally) is because it allows for a meaningful probabilistic comparison between character and word level models isolated from the issue of a closed vs open vocabulary.

Here we use torchtext which automatically adds a separate token, but it is never used and the LM immediately assigns 0 probability to it, so it doesn't matter that the extra token technically exists in the vocabulary. If it was assigned > 0 probability it would only hurt PPL overall.

hyunjik11 commented 4 years ago

Thanks for letting me know and the reference! Indeed I find it surprising that this is the convention adopted in the literature, especially given the high frequency of <unk> in PTB. I imagine this will inflate PPL quite a bit.

harvardnlp / TextFlow

Treatment of <unk> token in PTB #2