nyu-mll / BBQ

Repository for the Bias Benchmark for QA dataset.
Creative Commons Attribution 4.0 International
83 stars 20 forks source link

Using BBQ for traditional QA #6

Closed shalakasatheesh closed 1 year ago

shalakasatheesh commented 1 year ago

Hello,

I was wondering if you could tell me what you think of using the BBQ dataset for testing a traditional QA model (with just Question and Context as input and the Answer as the output)? Specifically, would the bias score calculation still apply in this scenario?

Best, Shalaka

Alicia-Parrish commented 1 year ago

Hi Shalaka,

This is definitely possible, and it's the evaluation method that I used in an evaluation of PaLM2 (https://arxiv.org/abs/2305.10403, see appendix D.6). Basically, you use the metadata file to identify for each item what string represents the bias target and the non-target, then just do string search on the generated output (with some reasonable text normalization). Here's how we handled coding the outputs:

I'd recommend manually verifying some percentage of the items you were able to automatically code as well, just to make sure that this eval method works for whatever model you're testing. Using these values, you can still compute the bias score in the same way as we computed it for the BBQ paper, but what we found in evaluating PaLM2 was that there are sometimes additional harms in the generated outputs that aren't captured by bias score.

-Alicia

shalakasatheesh commented 1 year ago

Hi Alicia,

Thank you for the detailed response and the link to the paper, really helpful. Also, thank you for the nice work here. :)

Best, Shalaka