huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.27k stars 2.7k forks source link

Different objects are returned from calls that should be returning the same kind of object. #6350

Open phalexo opened 1 year ago

phalexo commented 1 year ago

Describe the bug

    1.  dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=training_args.cache_dir, split='train[:1%]')
    2.  dataset = load_dataset("togethercomputer/RedPajama-Data-1T-Sample", cache_dir=training_args.cache_dir)

The only difference I would expect these calls to have is the size of the dataset.

But, while 2. returns a dictionary with "train" key in it, 1. returns a dataset WITHOUT any initial "train" keyword.

Both calls are to be used within exactly the same context. They should return identically structured datasets of different size.

Steps to reproduce the bug

See above.

Expected behavior

Expect both calls to return the same structured Dataset structure but with different number of elements, i.e. call 1. should have 1% of the data of the call 2.0

Environment info

Ubuntu 20.04 gcc 9.x.x.

It is really irrelevant.

mariosasko commented 1 year ago

load_dataset returns a DatasetDict object unless split is defined, in which case it returns a Dataset (or a list of datasets if split is a list). We've discussed dropping DatasetDict from the API in https://github.com/huggingface/datasets/issues/5189 to always return the same type in load_dataset and support datasets without (explicit) splits. IIRC the main discussion point is deciding what to return when loading a dataset with multiple splits, but split is not specified. What would you expect as a return value in that scenario?

phalexo commented 1 year ago

load_dataset returns a DatasetDict object unless split is defined, in which case it returns a Dataset (or a list of datasets if split is a list). We've discussed dropping DatasetDict from the API in #5189 to always return the same type in load_dataset and support datasets without (explicit) splits. IIRC the main discussion point is deciding what to return when loading a dataset with multiple splits, but split is not specified. What would you expect as a return value in that scenario?

Wouldn't a dataset with multiple splits already have keys and their related data arrays?

Lets say the dataset has "train" : trainset, "valid": validset and "test": testset

So a dictionary can be returned,, i.e.

{ "train": trainset, "valid": validset, "test": testset }

if a split is provided split=['train[:80%]', 'valid[80%:90%]', 'test[90%:100%]']

would also return the same dictionary as above.

split='train[:10%]' should return the same value as split=['train[:10%]']

{ "train": trainset }