Question: Why is Claude running with temperature 0 and GPT4 with temperature 0.5? wouldn't that be a handicap for Claude in Humaneval?

I'm also confused by this. Why not use temperature = 1 for both models, since that's the default temperature on the APIs?

Could you link to the "Anthropic example" you have in mind in this comment?

temperature: float = 0.0, # default in Anthropic example

Some additional references:

In the Claude 2 model card:

In all cases where evaluations involve free-form sampling, we evaluate our models at temperature T = 1 to represent normal usage, unless indicated otherwise.

In the Claude 3 model card, referring to multimodal evaluations:

All the results in Table 3 have been obtained by sampling at temperature T = 0.

In the Claude 3 model card, referring to the QuALITY benchmark:

We test both Claude 3 and Claude 2 model families in 0-shot and 1-shot settings, sampled with temperature T = 1.

openai / simple-evals

Question: Why is Claude running with temperature 0 and GPT4 with temperature 0.5? wouldn't that be a handicap for Claude in Humaneval? #13