huggingface / datasets

šŸ¤— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.25k stars 2.69k forks source link

DatasetDict save load Failing test in 1.6 not in 1.5 #2267

Open timothyjlaurent opened 3 years ago

timothyjlaurent commented 3 years ago

Describe the bug

We have a test that saves a DatasetDict to disk and then loads it from disk. In 1.6 there is an incompatibility in the schema.

Downgrading to >1.6 -- fixes the problem.

Steps to reproduce the bug


### Load a dataset dict from jsonl 

path = '/test/foo'

ds_dict.save_to_disk(path)

ds_from_disk = DatasetDict.load_from_disk(path).  ## <-- this is where I see the error on 1.6

Expected results

Upgrading to 1.6 shouldn't break that test. We should be able to serialize to and from disk.

Actual results

        # Infer features if None
        inferred_features = Features.from_arrow_schema(arrow_table.schema)
        if self.info.features is None:
            self.info.features = inferred_features

        # Infer fingerprint if None

        if self._fingerprint is None:
            self._fingerprint = generate_fingerprint(self)

        # Sanity checks

        assert self.features is not None, "Features can't be None in a Dataset object"
        assert self._fingerprint is not None, "Fingerprint can't be None in a Dataset object"
        if self.info.features.type != inferred_features.type:
>           raise ValueError(
                "External features info don't match the dataset:\nGot\n{}\nwith type\n{}\n\nbut expected something like\n{}\nwith type\n{}".format(
                    self.info.features, self.info.features.type, inferred_features, inferred_features.type
                )
            )
E           ValueError: External features info don't match the dataset:
E           Got
E           {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'child': Value(dtype='int64', id=None), 'child_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'color': Value(dtype='string', id=None), 'head': Value(dtype='int64', id=None), 'head_span': {'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None)}, 'label': Value(dtype='string', id=None)}], 'spans': [{'end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'token_end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'disabled': Value(dtype='bool', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'start': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'ws': Value(dtype='bool', id=None)}]}
E           with type
E           struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<child: int64, child_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, color: string, head: int64, head_span: struct<end: int64, label: string, start: int64, token_end: int64, token_start: int64>, label: string>>, spans: list<item: struct<end: int64, label: string, start: int64, text: string, token_end: int64, token_start: int64, type: string>>, text: string, tokens: list<item: struct<disabled: bool, end: int64, id: int64, start: int64, text: string, ws: bool>>>
E           
E           but expected something like
E           {'_input_hash': Value(dtype='int64', id=None), '_task_hash': Value(dtype='int64', id=None), '_view_id': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None), 'encoding__ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'encoding__offsets': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), 'encoding__overflowing': Sequence(feature=Value(dtype='null', id=None), length=-1, id=None), 'encoding__tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'encoding__words': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'ner_labels': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': [{'head': Value(dtype='int64', id=None), 'child': Value(dtype='int64', id=None), 'head_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'child_span': {'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'label': Value(dtype='string', id=None)}, 'color': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'spans': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'token_start': Value(dtype='int64', id=None), 'token_end': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'type': Value(dtype='string', id=None), 'label': Value(dtype='string', id=None)}], 'text': Value(dtype='string', id=None), 'tokens': [{'text': Value(dtype='string', id=None), 'start': Value(dtype='int64', id=None), 'end': Value(dtype='int64', id=None), 'id': Value(dtype='int64', id=None), 'ws': Value(dtype='bool', id=None), 'disabled': Value(dtype='bool', id=None)}]}
E           with type
E           struct<_input_hash: int64, _task_hash: int64, _view_id: string, answer: string, encoding__ids: list<item: int64>, encoding__offsets: list<item: list<item: int64>>, encoding__overflowing: list<item: null>, encoding__tokens: list<item: string>, encoding__words: list<item: int64>, ner_ids: list<item: int64>, ner_labels: list<item: string>, relations: list<item: struct<head: int64, child: int64, head_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, child_span: struct<start: int64, end: int64, token_start: int64, token_end: int64, label: string>, color: string, label: string>>, spans: list<item: struct<text: string, start: int64, token_start: int64, token_end: int64, end: int64, type: string, label: string>>, text: string, tokens: list<item: struct<text: string, start: int64, end: int64, id: int64, ws: bool, disabled: bool>>>

../../../../../.virtualenvs/tf_ner_rel_lib/lib/python3.8/site-packages/datasets/arrow_dataset.py:274: ValueError

Versions

lhoestq commented 3 years ago

Thanks for reporting ! We're looking into it

lhoestq commented 3 years ago

I'm not able to reproduce this, do you think you can provide a code that creates a DatasetDict that has this issue when saving and reloading ?

maxidl commented 3 years ago

Hi, I just ran into a similar error. Here is the minimal code to reproduce:

from datasets import load_dataset, DatasetDict
ds = load_dataset('super_glue', 'multirc')

ds.save_to_disk('tempds')

ds = DatasetDict.load_from_disk('tempds')
Reusing dataset super_glue (/home/idahl/.cache/huggingface/datasets/super_glue/multirc/1.0.2/2fb163bca9085c1deb906aff20f00c242227ff704a4e8c9cfdfe820be3abfc83)
Traceback (most recent call last):
  File "/home/idahl/eval-util-expl/multirc/tmp.py", line 7, in <module>
    ds = DatasetDict.load_from_disk('tempds')
  File "/home/idahl/miniconda3/envs/eval-util-expl/lib/python3.9/site-packages/datasets/dataset_dict.py", line 710, in load_from_disk
    dataset_dict[k] = Dataset.load_from_disk(dataset_dict_split_path, fs, keep_in_memory=keep_in_memory)
  File "/home/idahl/miniconda3/envs/eval-util-expl/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 687, in load_from_disk
    return Dataset(
  File "/home/idahl/miniconda3/envs/eval-util-expl/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 274, in __init__
    raise ValueError(
ValueError: External features info don't match the dataset:
Got
{'answer': Value(dtype='string', id=None), 'idx': {'answer': Value(dtype='int32', id=None), 'paragraph': Value(dtype='int32', id=None), 'question': Value(dtype='int32', id=None)}, 'label': ClassLabel(num_classes=2, names=['False', 'True'], names_file=None, id=None), 'paragraph': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)}
with type
struct<answer: string, idx: struct<answer: int32, paragraph: int32, question: int32>, label: int64, paragraph: string, question: string>

but expected something like
{'answer': Value(dtype='string', id=None), 'idx': {'paragraph': Value(dtype='int32', id=None), 'question': Value(dtype='int32', id=None), 'answer': Value(dtype='int32', id=None)}, 'label': Value(dtype='int64', id=None), 'paragraph': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None)}
with type
struct<answer: string, idx: struct<paragraph: int32, question: int32, answer: int32>, label: int64, paragraph: string, question: string>

The non-matching part seems to be 'label': ClassLabel(num_classes=2, names=['False', 'True'], names_file=None, id=None), vs 'label': Value(dtype='int64', id=None),

And the order in the <struct... being different, which might cause the features.type != inferred_features.type condition to become true and raise this ValueError.

I am using datasets version 1.6.2.

Edit: can confirm, this works without error in version 1.5.0

maxidl commented 3 years ago

My current workaround is to remove the idx feature:


from datasets import load_dataset, DatasetDict, Value
ds = load_dataset('super_glue', 'multirc')
ds = ds.remove_columns('idx')

ds.save_to_disk('tempds')

ds = DatasetDict.load_from_disk('tempds')

works.

lhoestq commented 3 years ago

It looks like this issue comes from the order of the fields in the 'idx' struct that is different for some reason. I'm looking into it. Note that as a workaround you can also flatten the nested features with ds = ds.flatten()

lhoestq commented 3 years ago

I just pushed a fix on master. We'll do a new release soon !

Thanks for reporting