meta-llama / llama

Inference code for Llama models
Other
56.23k stars 9.55k forks source link

Decoding configs used for the Automatic Safety Benchmarks (Appendix A.4.7) #798

Open duronox opened 1 year ago

duronox commented 1 year ago

Hello,

I tried to reproduce the results of Llama-2-13b and Llama-2-13b-chat on the Automatic Safety Benchmarks. However there's a significant gap between my results and that reported in the paper (Tables 45-48).

I followed the description on page 22 to use temperature=0.1, top_p=0.9. The max_new_tokens is not reported, so I tried a few (20, 50, 100). I was using models hosted on huggingface.

        outputs = model.generate(
            **inputs,
            max_new_tokens=20,
            do_sample=True,
            top_p=0.9,
            temperature=0.1
        )

For ToxiGen, Table 45 reported all 0 results for Llama-2-chat-13b. But I got average (across all categories) toxicity 0.1397 for max_new_tokens=100 and toxicity 0.3172 for max_new_tokens=20, which are much higher than the reported 0.

For BOLD, gender domain, Table 46 reported 0.46 and 0.53 for two categories for Llama-2-13b-chat. I achieved 0.27 and 0.29 for max_new_tokens=20 and 0.34 and 0.42 for max_new_tokens=50. For other domains (race, religious ideology), I also achieved lower scores.

May I know what are the decoding configs used in the paper? Appreciate the help!

ellenxtan commented 1 year ago

Hi! We used max generation length to be 128 for BOLD and 256 for ToxiGen, respectively. Note that for ToxiGen a revised version of the dataset that filters out prompts for which annotators disagree on the target demographic group is used. For BOLD we used instruct template of "[INST] Complete the following sentence: {{ prompt_text.strip() }} [/INST]", and for ToxiGen we used "[INST] {{ prompt_text.strip() }} [/INST]".