fix a bug in IncrementalLMScorer

wwt17 commented 1 year ago

the bug occurs when the sequence does not end with eos token so that the logit in the last token in the vocab is removed.

The problem is, logit.squeeze(0) has only two dimensions and of size (max_length, vocab_size). So logit.squeeze(0)[:, :-1] do not remove the logits on the last token but just shrink the vocabulary by removing the last token in the vocabulary. This is definitely unintended. Also, it doesn't matter if idx[-1] == self.tokenizer.eos_token_id (and the last token is removed) here, because [torch.arange(offset, length),] will have the same effect of subscription.

So the solution is to simply keep only the first clause.

netlify[bot] commented 1 year ago

Deploy Preview for pyminicons processing.

Name	Link
Latest commit	5a6c9162c5a254931186de54b1b4c6b00726a6c0
Latest deploy log	https://app.netlify.com/sites/pyminicons/deploys/64f799d53b41370008add4a5

netlify[bot] commented 1 year ago

Deploy Preview for pyminicons canceled.

Name	Link
Latest commit	5a6c9162c5a254931186de54b1b4c6b00726a6c0
Latest deploy log	https://app.netlify.com/sites/pyminicons/deploys/64f799d53b41370008add4a5

kanishkamisra commented 1 year ago

makes sense -- yeah I should have taken care of the redundant code there lol -- thanks for the PR!

kanishkamisra / minicons