Open duronox opened 1 year ago
Hi! We used max generation length to be 128 for BOLD and 256 for ToxiGen, respectively. Note that for ToxiGen a revised version of the dataset that filters out prompts for which annotators disagree on the target demographic group is used. For BOLD we used instruct template of "[INST] Complete the following sentence: {{ prompt_text.strip() }} [/INST]", and for ToxiGen we used "[INST] {{ prompt_text.strip() }} [/INST]".
Hello,
I tried to reproduce the results of Llama-2-13b and Llama-2-13b-chat on the Automatic Safety Benchmarks. However there's a significant gap between my results and that reported in the paper (Tables 45-48).
I followed the description on page 22 to use temperature=0.1, top_p=0.9. The max_new_tokens is not reported, so I tried a few (20, 50, 100). I was using models hosted on huggingface.
For ToxiGen, Table 45 reported all 0 results for Llama-2-chat-13b. But I got average (across all categories) toxicity 0.1397 for max_new_tokens=100 and toxicity 0.3172 for max_new_tokens=20, which are much higher than the reported 0.
For BOLD, gender domain, Table 46 reported 0.46 and 0.53 for two categories for Llama-2-13b-chat. I achieved 0.27 and 0.29 for max_new_tokens=20 and 0.34 and 0.42 for max_new_tokens=50. For other domains (race, religious ideology), I also achieved lower scores.
May I know what are the decoding configs used in the paper? Appreciate the help!