huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.13k stars 2.66k forks source link

For data-only datasets, streaming and non-streaming don't behave the same #3738

Open severo opened 2 years ago

severo commented 2 years ago

See https://huggingface.co/datasets/huggingface/transformers-metadata: it only contains two JSON files.

In streaming mode, the files are concatenated, and thus the rows might be dictionaries with different keys:

import datasets as ds
iterable_dataset = ds.load_dataset("huggingface/transformers-metadata", split="train", streaming=True);
rows = list(iterable_dataset.take(100))
rows[0]
# {'model_type': 'albert', 'pytorch': True, 'tensorflow': True, 'flax': True, 'processor': 'AutoTokenizer'}
rows[99]
# {'model_class': 'BartModel', 'pipeline_tag': 'feature-extraction', 'auto_class': 'AutoModel'}

In normal mode, an exception is thrown:

import datasets as ds
dataset = ds.load_dataset("huggingface/transformers-metadata", split="train");
ValueError: Couldn't cast
model_class: string
pipeline_tag: string
auto_class: string
to
{'model_type': Value(dtype='string', id=None), 'pytorch': Value(dtype='bool', id=None), 'tensorflow': Value(dtype='bool', id=None), 'flax': Value(dtype='bool', id=None), 'processor': Value(dtype='string', id=None)}
because column names don't match
severo commented 2 years ago

Note that we might change the heuristic and create a different config per file, at least in that case.

albertvillanova commented 2 years ago

Hi @severo, thanks for reporting.

Yes, this happens because when non-streaming, a cast of all data is done in order to "concatenate" it all into a single dataset (thus the error), while this casting is not done while yielding item by item in streaming mode.

Maybe in streaming mode we should keep the schema (inferred from the first item) and throw an exception if a subsequent item does not conform to the inferred schema?

severo commented 2 years ago

Why do we want to concatenate the files? Is it the expected behavior for most datasets that lack a script and dataset info?

lhoestq commented 2 years ago

These files are two different dataset configurations since they don't share the same schema.

IMO the streaming mode should fail in this case, as @albertvillanova said.

There is one challenge though: inferring the schema from the first example is not robust enough in the general case - especially if some fields are nullable. I guess we can at least make sure that no new columns are added

severo commented 2 years ago

OK. So, if we make the streaming also fail, the dataset https://huggingface.co/datasets/huggingface/transformers-metadata will never be viewable (be it using streaming or fallback to downloading the files), right?

lhoestq commented 2 years ago

Yes, until we have a way for the user to specify explicitly that those two files are different configurations.

We can maybe have some rule to detect this automatically, maybe checking the first line of each file ? That would mean that for dataset of 10,000+ files we would have to verify every single one of them just to know if there is one ore more configurations, so I'm not sure if this is a good idea

julien-c commented 2 years ago

i think requiring the user to specify that those two files are different configurations is in that case perfectly reasonable.

(Maybe at some point we could however detect this type of case and prompt them to define a config mapping etc)

severo commented 2 years ago

OK, so, before closing the issue, what do you think should be done?

Maybe in streaming mode we should keep the schema (inferred from the first item) and throw an exception if a subsequent item does not conform to the inferred schema?

or nothing?

lhoestq commented 2 years ago

We should at least raise an error if a new sample has column names that are missing, or if it has extra columns. No need to check for the type for now.

I'm in favor of having an error especially because we want to avoid silent issues as much as possible - i.e. when something goes wrong (when schemas don't match or some data are missing) and no errors/warnings are raised.

Consistency between streaming and non-streaming is also important.