Open phalexo opened 1 year ago
load_dataset
returns a DatasetDict
object unless split
is defined, in which case it returns a Dataset
(or a list of datasets if split
is a list). We've discussed dropping DatasetDict
from the API in https://github.com/huggingface/datasets/issues/5189 to always return the same type in load_dataset
and support datasets without (explicit) splits. IIRC the main discussion point is deciding what to return when loading a dataset with multiple splits, but split
is not specified. What would you expect as a return value in that scenario?
load_dataset
returns aDatasetDict
object unlesssplit
is defined, in which case it returns aDataset
(or a list of datasets ifsplit
is a list). We've discussed droppingDatasetDict
from the API in #5189 to always return the same type inload_dataset
and support datasets without (explicit) splits. IIRC the main discussion point is deciding what to return when loading a dataset with multiple splits, butsplit
is not specified. What would you expect as a return value in that scenario?
Wouldn't a dataset with multiple splits already have keys and their related data arrays?
Lets say the dataset has "train" : trainset, "valid": validset and "test": testset
So a dictionary can be returned,, i.e.
{ "train": trainset, "valid": validset, "test": testset }
if a split is provided split=['train[:80%]', 'valid[80%:90%]', 'test[90%:100%]']
would also return the same dictionary as above.
split='train[:10%]' should return the same value as split=['train[:10%]']
{ "train": trainset }
Describe the bug
The only difference I would expect these calls to have is the size of the dataset.
But, while 2. returns a dictionary with "train" key in it, 1. returns a dataset WITHOUT any initial "train" keyword.
Both calls are to be used within exactly the same context. They should return identically structured datasets of different size.
Steps to reproduce the bug
See above.
Expected behavior
Expect both calls to return the same structured Dataset structure but with different number of elements, i.e. call 1. should have 1% of the data of the call 2.0
Environment info
Ubuntu 20.04 gcc 9.x.x.
It is really irrelevant.