StereoSet benchmark for GPT2

Lj1ang commented 3 weeks ago

I noticed that GPT2Tokenizer is used when evaluating GPT2, which doesn't have a mask_token. Will this impact the evaluation result? I think I should add a new one manually but I'm unsure which one I should add.

jacqueline-he commented 3 weeks ago

Hi, thanks for your interest in our work! :)

That’s a great question! GPT-2 is trained auto-regressively and therefore cannot be evaluated in the same manner as a masked language model. Instead of evaluating as a fill-in-the-blank problem, it's recommended that you compute the probability of the sentence when the blank is filled with a stereotypical term, and then with an anti-stereotypical term, and score based on whichever is more likely.

I would defer to Section 6.2 in the original StereoSet paper for more details.

Lj1ang commented 2 weeks ago

Thanks for your reply!

princeton-nlp / MABEL

StereoSet benchmark for GPT2 #8