Open Lj1ang opened 3 weeks ago
Hi, thanks for your interest in our work! :)
That’s a great question! GPT-2 is trained auto-regressively and therefore cannot be evaluated in the same manner as a masked language model. Instead of evaluating as a fill-in-the-blank problem, it's recommended that you compute the probability of the sentence when the blank is filled with a stereotypical term, and then with an anti-stereotypical term, and score based on whichever is more likely.
I would defer to Section 6.2 in the original StereoSet paper for more details.
Thanks for your reply!
I noticed that GPT2Tokenizer is used when evaluating GPT2, which doesn't have a mask_token. Will this impact the evaluation result? I think I should add a new one manually but I'm unsure which one I should add.