Closed etr2460 closed 6 months ago
This is the behavior MMMU used for evaluating, so we should match this here.
As an example this increased the mmmu-music benchmark from 0.3666 to 0.4 as multiple questions in that benchmark were unanswered by the model
0.3666
0.4
This is the behavior MMMU used for evaluating, so we should match this here.
As an example this increased the mmmu-music benchmark from
0.3666
to0.4
as multiple questions in that benchmark were unanswered by the model