Open lhoestq opened 2 years ago
Why not adding max_examples
as part of the config name?
Yup it can also work, and maybe it's simpler this way. Opening a PR to fix bigbench instead of https://github.com/huggingface/datasets/pull/4463
Hi @lhoestq,
Thank you for taking a look at this issue, and proposing a solution. Unfortunately, after trying the fix in #4465 I still see the same issue.
I think there is some subtlety where the config name gets overwritten somewhere when BUILDER_CONFIGS
(link) is defined.
If I print out the self.config.name
in the current version (with the fix in #4465), I see just the task name, but if I comment out BUILDER_CONFIGS
, the num_shots
and max_examples
gets appended as was meant by #4465.
I haven't managed to track down where this happens, but I thought you might know?
(Another comment on your fix: the name
variable is used to fetch the task from the bigbench API, so modifying it causes an error if it's actually called. This can easily be fixed by having config_name
variable in addition to the task_name
)
As noticed in https://github.com/huggingface/datasets/pull/4125 when a dataset config class has a parameter that reduces the number of examples (e.g. named
max_examples
), then loading the dataset and passingmax_examples
raisesNonMatchingSplitsSizesError
.This is because it will check for expected the number of examples of the config with the same name without taking into account the
max_examples
parameter. This can be fixed by checking the expected number of examples using the config id instead of name. Indeed the config id corresponds to the config name + an optional suffix that depends on the config parameters