nyu-mll / BBQ

Repository for the Bias Benchmark for QA dataset.
Creative Commons Attribution 4.0 International
83 stars 20 forks source link

Reproducibility Issues (Bias Score and Accuracy) #12

Open rajneesh407 opened 5 days ago

rajneesh407 commented 5 days ago

I have been working on reproducing the results using the code from your repository and implementing it in Python. I successfully converted the provided R code into Python, and the outputs from both versions match. However, the results from your repository code do not align with the figures presented in your research paper.

Code used: [BBQ_calculate_bias_score.R] (https://github.com/nyu-mll/BBQ/blob/main/analysis_scripts/BBQ_calculate_bias_score.R) Research paper link: [QA Bias Benchmark] (https://github.com/nyu-mll/BBQ/blob/main/QA_bias_benchmark.pdf)

Here is the output I obtained using the R code from your repository:

image

Comparing Dberta V3 Base (For Disambiguous): Comparison With Paper :

Same patter can be seen across other models also.

Any help/clarifications would be appreciated here. Thanks