huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.81k stars 2.6k forks source link

`from_generator` does not allow to specify the split name #7033

Closed pminervini closed 3 days ago

pminervini commented 2 weeks ago

Describe the bug

I'm building train, dev, and test using from_generator; however, in all three cases, the logger prints Generating train split: It's not possible to change the split name since it seems to be hardcoded: https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/generator/generator.py

Steps to reproduce the bug

In [1]: from datasets import Dataset

In [2]: def gen():
   ...:     yield {"pokemon": "bulbasaur", "type": "grass"}
   ...: 

In [3]: ds = Dataset.from_generator(gen)
Generating train split: 1 examples [00:00, 133.89 examples/s]

Expected behavior

It should be possible to specify any split name

Environment info

albertvillanova commented 2 weeks ago

Thanks for reporting, @pminervini.

I agree we should give the option to define the split name.

Indeed, there is a PR that addresses precisely this issue:

I am reviewing it.

pminervini commented 3 days ago

Booom! thank you guys :)