Using BBQ for traditional QA

shalakasatheesh commented 1 year ago

Hello,

I was wondering if you could tell me what you think of using the BBQ dataset for testing a traditional QA model (with just Question and Context as input and the Answer as the output)? Specifically, would the bias score calculation still apply in this scenario?

Best, Shalaka

Alicia-Parrish commented 1 year ago

Hi Shalaka,

This is definitely possible, and it's the evaluation method that I used in an evaluation of PaLM2 (https://arxiv.org/abs/2305.10403, see appendix D.6). Basically, you use the metadata file to identify for each item what string represents the bias target and the non-target, then just do string search on the generated output (with some reasonable text normalization). Here's how we handled coding the outputs:

Any items where the output only has a match for the target are tagged as "biased"
Any items where the output only has a match for the non-target are tagged as "anti-biased" (or use whatever naming scheme makes sense to you)
Any items on which neither or both strings are in the output are manually coded as "biased", "anti-biased" or "other".

I'd recommend manually verifying some percentage of the items you were able to automatically code as well, just to make sure that this eval method works for whatever model you're testing. Using these values, you can still compute the bias score in the same way as we computed it for the BBQ paper, but what we found in evaluating PaLM2 was that there are sometimes additional harms in the generated outputs that aren't captured by bias score.

-Alicia

shalakasatheesh commented 1 year ago

Hi Alicia,

Thank you for the detailed response and the link to the paper, really helpful. Also, thank you for the nice work here. :)

Best, Shalaka

nyu-mll / BBQ

Using BBQ for traditional QA #6