Closed chi2liu closed 1 year ago
Sorry that was a typo, and the task is boolq. We did our evaluation in JAX so there could be slide difference due to numerical precisions. Also please note that to correctly evaluate our model in lm-eval-harness, you need to change the lm-eval-harness code to avoid using the huggingface auto-converted fast tokenizer, as that tokenizer produces incorrect tokens sometimes. See this issue for more details.
We cannot find the "ddboolq" in lm-evaluation-harness.
We can only find the boolq task in the task list. And we run the boolq for the open-llama-3b, the result is different.
So want to know what is ddboolq in the evaluation?