Open timothyjlaurent opened 3 years ago
Thanks for reporting ! We're looking into it
I'm not able to reproduce this, do you think you can provide a code that creates a DatasetDict that has this issue when saving and reloading ?
Hi, I just ran into a similar error. Here is the minimal code to reproduce:
from datasets import load_dataset, DatasetDict
ds = load_dataset('super_glue', 'multirc')
ds.save_to_disk('tempds')
ds = DatasetDict.load_from_disk('tempds')
Reusing dataset super_glue (/home/idahl/.cache/huggingface/datasets/super_glue/multirc/1.0.2/2fb163bca9085c1deb906aff20f00c242227ff704a4e8c9cfdfe820be3abfc83)
Traceback (most recent call last):
File "/home/idahl/eval-util-expl/multirc/tmp.py", line 7, in <module>
ds = DatasetDict.load_from_disk('tempds')
File "/home/idahl/miniconda3/envs/eval-util-expl/lib/python3.9/site-packages/datasets/dataset_dict.py", line 710, in load_from_disk
dataset_dict[k] = Dataset.load_from_disk(dataset_dict_split_path, fs, keep_in_memory=keep_in_memory)
File "/home/idahl/miniconda3/envs/eval-util-expl/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 687, in load_from_disk
return Dataset(
File "/home/idahl/miniconda3/envs/eval-util-expl/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 274, in __init__
raise ValueError(
ValueError: External features info don't match the dataset:
Got
{'answer': Value(dtype='string', id=None), 'idx': {'answer': Value(dtype='int32', id=None), 'paragraph': Value(dtype='int32', id=None), 'question': Value(dtype='int32', id=None)}, 'label': ClassLabel(num_classes=2, names=['False', 'True'], names_file=None, id=None), 'paragraph': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)}
with type
struct<answer: string, idx: struct<answer: int32, paragraph: int32, question: int32>, label: int64, paragraph: string, question: string>
but expected something like
{'answer': Value(dtype='string', id=None), 'idx': {'paragraph': Value(dtype='int32', id=None), 'question': Value(dtype='int32', id=None), 'answer': Value(dtype='int32', id=None)}, 'label': Value(dtype='int64', id=None), 'paragraph': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)}
with type
struct<answer: string, idx: struct<paragraph: int32, question: int32, answer: int32>, label: int64, paragraph: string, question: string>
The non-matching part seems to be
'label': ClassLabel(num_classes=2, names=['False', 'True'], names_file=None, id=None),
vs
'label': Value(dtype='int64', id=None),
And the order in the <struct...
being different, which might cause the features.type != inferred_features.type condition to become true and raise this ValueError.
I am using datasets version 1.6.2.
Edit: can confirm, this works without error in version 1.5.0
My current workaround is to remove the idx feature:
from datasets import load_dataset, DatasetDict, Value
ds = load_dataset('super_glue', 'multirc')
ds = ds.remove_columns('idx')
ds.save_to_disk('tempds')
ds = DatasetDict.load_from_disk('tempds')
works.
It looks like this issue comes from the order of the fields in the 'idx' struct that is different for some reason.
I'm looking into it. Note that as a workaround you can also flatten the nested features with ds = ds.flatten()
I just pushed a fix on master
. We'll do a new release soon !
Thanks for reporting
Describe the bug
We have a test that saves a DatasetDict to disk and then loads it from disk. In 1.6 there is an incompatibility in the schema.
Downgrading to
>1.6
-- fixes the problem.Steps to reproduce the bug
Expected results
Upgrading to 1.6 shouldn't break that test. We should be able to serialize to and from disk.
Actual results
Versions