kanishkamisra / minicons

Utility for behavioral and representational analyses of Language Models
https://minicons.kanishka.website
MIT License
122 stars 29 forks source link

fix a bug in IncrementalLMScorer #34

Closed wwt17 closed 1 year ago

wwt17 commented 1 year ago

the bug occurs when the sequence does not end with eos token so that the logit in the last token in the vocab is removed.

The problem is, logit.squeeze(0) has only two dimensions and of size (max_length, vocab_size). So logit.squeeze(0)[:, :-1] do not remove the logits on the last token but just shrink the vocabulary by removing the last token in the vocabulary. This is definitely unintended. Also, it doesn't matter if idx[-1] == self.tokenizer.eos_token_id (and the last token is removed) here, because [torch.arange(offset, length),] will have the same effect of subscription.

So the solution is to simply keep only the first clause.

netlify[bot] commented 1 year ago

Deploy Preview for pyminicons processing.

Name Link
Latest commit 5a6c9162c5a254931186de54b1b4c6b00726a6c0
Latest deploy log https://app.netlify.com/sites/pyminicons/deploys/64f799d53b41370008add4a5
netlify[bot] commented 1 year ago

Deploy Preview for pyminicons canceled.

Name Link
Latest commit 5a6c9162c5a254931186de54b1b4c6b00726a6c0
Latest deploy log https://app.netlify.com/sites/pyminicons/deploys/64f799d53b41370008add4a5
kanishkamisra commented 1 year ago

makes sense -- yeah I should have taken care of the redundant code there lol -- thanks for the PR!