Add `within_word_l2r` pseudo-log-likelihood scoring method for masked language models

carina-kauf commented 1 year ago

This PR adds a better scoring method for masked language models (Kauf & Ivanova, 2023) to the MaskedLMScorer class within the scorer module.

Key addition: `PLL_metric='within_word_l2r'` scoring option

The key addition is a new optional string argument called PLL_metric, which can take one of two values:

'original' (default) : this option implements the original pseudo-log-likelihood scoring functionality for masked language models, following Salazar et al. (2020).
'within_word_l2r' : this option implements the improved pseudo-log-likelihood scoring functionality for masked language models, following Kauf & Ivanova (2023); paper to appear in the proceedings of ACL2023. This new pseudo-log-likelihood scoring method leverages a locally-autoregressive scoring strategy to avoid the overestimation of probabilities of tokens in multi-token words. In particular, tokens probabilities are estimated using the bidirectional context, excluding any future tokens that belong to the same word as the current target token.

The optional PLL_metric string argument was added to the following functions within the MaskedLMScorer class:

prepare_text: key update to attention masks (mask out future subword tokens, leveraging word ids)
sequence_score: now calls the prepare_text function with the PLL_metric argument
token_score: now calls the prepare_text function with the PLL_metric argument

Usage

from scorer import MaskedLMScorer
mlm_model = MaskedLMScorer('bert-base-uncased', 'cpu')

stimuli = ['The traveler lost the souvenir.']

print(mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item(), PLL_metric='within_word_l2r'))
'''
[32.77983617782593]
'''

print(mlm_model.token_score(stimuli, PLL_metric='within_word_l2r'))
'''
[[('the', -0.07324600219726562), ('traveler', -9.668401718139648), ('lost', -6.955361366271973),
('the', -1.1923179626464844), ('so', -7.776356220245361), ('##uven', -6.989711761474609),
('##ir', -0.037807464599609375), ('.', -0.08663368225097656)]]
'''

netlify[bot] commented 1 year ago

Deploy Preview for pyminicons canceled.

Name	Link
Latest commit	5637b27bc178f4224cd8838530c9c47efe07aefe
Latest deploy log	https://app.netlify.com/sites/pyminicons/deploys/648869ec7c15cc000876fd54

netlify[bot] commented 1 year ago

Deploy Preview for pyminicons canceled.

Name	Link
Latest commit	5637b27bc178f4224cd8838530c9c47efe07aefe
Latest deploy log	https://app.netlify.com/sites/pyminicons/deploys/648869ec7c15cc000876fd54

kanishkamisra commented 1 year ago

this is brilliant! Thanks @carina-kauf!!

kanishkamisra / minicons