huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Fix data file module inference #7132

Open HennerM opened 3 months ago

HennerM commented 3 months ago

I saved a dataset with two splits to disk with DatasetDict.save_to_disk. The train is bigger and ended up in 10 shards, whereas the test split only resulted in 1 split.

Now when trying to load the dataset, an error is raised that not all splits have the same data format:

ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}

This is not expected because both splits are saved as arrow files.

I did some debugging and found that this is the case because the list of data_files includes a state.json file.

Now this means for train split I get 10 ".arrow" and 1 ".json" file. Since datasets picks based on the most common extension this is correctly inferred as "arrow". In the test split, there is 1 .arrow and 1 .json file. Given the function description:

It picks the module based on the most common file extension. In case of a draw ".parquet" is the favorite, and then alphabetical order.

This is not quite true though, because in a tie the extensions are actually based on reverse-alphabetical order:

for (ext, _), _ in sorted(extensions_counter.items(), key=sort_key, *reverse=True*):

Which thus leads to the module wrongly inferred as "json", whereas it should be "arrow", matching the train split.

I first thought about adding "state.json" in the list of excluded files for the inference: https://github.com/huggingface/datasets/blob/main/src/datasets/load.py#L513. However, I think from digging into the code it looks like the right thing to do is to exclude it in the list of data_files to start with, because it is more of a metadata than a data file.

lhoestq commented 2 months ago

Hi ! datasets saved using save_to_disk should be loaded with load_from_disk ;)

HennerM commented 2 months ago

It is convienient to just pass in a path to a local dataset or one from the hub and use the same function to load it. Is it not possible to get this fix merged in to allow this?

lhoestq commented 2 months ago

We can modify save_to_disk to write the dataset in a structure supported by the Hub in this case, it's kind of a legacy function anyway