Closed ggbetz closed 1 month ago
We've now had a closer a look. It appears that mainly base models, which have not been instruction-tuned, fail to follow the instructions to reason step by step, i.e. fail to generate reasoning traces at all. – Which makes sense.
In the DatasetViewer of
cot-leaderboard/cot-eval-traces-2.0
, it appears that many reasoning traces. Is this a bug?