Open HennerM opened 3 months ago
Hi ! datasets saved using save_to_disk
should be loaded with load_from_disk
;)
It is convienient to just pass in a path to a local dataset or one from the hub and use the same function to load it. Is it not possible to get this fix merged in to allow this?
We can modify save_to_disk
to write the dataset in a structure supported by the Hub in this case, it's kind of a legacy function anyway
I saved a dataset with two splits to disk with
DatasetDict.save_to_disk
. The train is bigger and ended up in 10 shards, whereas the test split only resulted in 1 split.Now when trying to load the dataset, an error is raised that not all splits have the same data format:
This is not expected because both splits are saved as arrow files.
I did some debugging and found that this is the case because the list of data_files includes a
state.json
file.Now this means for train split I get 10 ".arrow" and 1 ".json" file. Since datasets picks based on the most common extension this is correctly inferred as "arrow". In the test split, there is 1 .arrow and 1 .json file. Given the function description:
This is not quite true though, because in a tie the extensions are actually based on reverse-alphabetical order:
Which thus leads to the module wrongly inferred as "json", whereas it should be "arrow", matching the train split.
I first thought about adding "state.json" in the list of excluded files for the inference: https://github.com/huggingface/datasets/blob/main/src/datasets/load.py#L513. However, I think from digging into the code it looks like the right thing to do is to exclude it in the list of
data_files
to start with, because it is more of a metadata than a data file.