huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.22k stars 2.68k forks source link

BigBench: NonMatchingSplitsSizesError when passing a dataset configuration parameter #4462

Open lhoestq opened 2 years ago

lhoestq commented 2 years ago

As noticed in https://github.com/huggingface/datasets/pull/4125 when a dataset config class has a parameter that reduces the number of examples (e.g. named max_examples), then loading the dataset and passing max_examples raises NonMatchingSplitsSizesError.

This is because it will check for expected the number of examples of the config with the same name without taking into account the max_examples parameter. This can be fixed by checking the expected number of examples using the config id instead of name. Indeed the config id corresponds to the config name + an optional suffix that depends on the config parameters

albertvillanova commented 2 years ago

Why not adding max_examples as part of the config name?

lhoestq commented 2 years ago

Yup it can also work, and maybe it's simpler this way. Opening a PR to fix bigbench instead of https://github.com/huggingface/datasets/pull/4463

andersjohanandreassen commented 2 years ago

Hi @lhoestq,

Thank you for taking a look at this issue, and proposing a solution. Unfortunately, after trying the fix in #4465 I still see the same issue.

I think there is some subtlety where the config name gets overwritten somewhere when BUILDER_CONFIGS(link) is defined.

If I print out the self.config.name in the current version (with the fix in #4465), I see just the task name, but if I comment out BUILDER_CONFIGS, the num_shots and max_examples gets appended as was meant by #4465.

I haven't managed to track down where this happens, but I thought you might know?

(Another comment on your fix: the name variable is used to fetch the task from the bigbench API, so modifying it causes an error if it's actually called. This can easily be fixed by having config_name variable in addition to the task_name)