Decoding configs used for the Automatic Safety Benchmarks (Appendix A.4.7)

Hello,

I tried to reproduce the results of Llama-2-13b and Llama-2-13b-chat on the Automatic Safety Benchmarks. However there's a significant gap between my results and that reported in the paper (Tables 45-48).

I followed the description on page 22 to use temperature=0.1, top_p=0.9. The max_new_tokens is not reported, so I tried a few (20, 50, 100). I was using models hosted on huggingface.

        outputs = model.generate(
            **inputs,
            max_new_tokens=20,
            do_sample=True,
            top_p=0.9,
            temperature=0.1
        )

For ToxiGen, Table 45 reported all 0 results for Llama-2-chat-13b. But I got average (across all categories) toxicity 0.1397 for max_new_tokens=100 and toxicity 0.3172 for max_new_tokens=20, which are much higher than the reported 0.

For BOLD, gender domain, Table 46 reported 0.46 and 0.53 for two categories for Llama-2-13b-chat. I achieved 0.27 and 0.29 for max_new_tokens=20 and 0.34 and 0.42 for max_new_tokens=50. For other domains (race, religious ideology), I also achieved lower scores.

May I know what are the decoding configs used in the paper? Appreciate the help!

meta-llama / llama

Decoding configs used for the Automatic Safety Benchmarks (Appendix A.4.7) #798