huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

datasets cannot handle nested json if features is given. #7116

Closed ljw20180420 closed 2 months ago

ljw20180420 commented 3 months ago

Describe the bug

I have a json named temp.json.

{"ref1": "ABC", "ref2": "DEF", "cuts":[{"cut1": 3, "cut2": 5}]}

I want to load it.

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': datasets.Sequence({
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    })
}))

The above code does not work. However, I can load it without giving features.

ds = datasets.load_dataset('json', data_files="./temp.json")

Is it possible to load integers as uint16 to save some memory?

Steps to reproduce the bug

As in the bug description.

Expected behavior

The data are loaded and integers are uint16.

Environment info

Copy-and-paste the text below in your GitHub issue.

lhoestq commented 3 months ago

Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': [{
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    }]
}))
ljw20180420 commented 2 months ago

Hi ! Sequence has a weird behavior for dictionaries (from tensorflow-datasets), use a regular list instead:

ds = datasets.load_dataset('json', data_files="./temp.json", features=datasets.Features({
    'ref1': datasets.Value('string'),
    'ref2': datasets.Value('string'),
    'cuts': [{
        "cut1": datasets.Value("uint16"),
        "cut2": datasets.Value("uint16")
    }]
}))

Thank you!

ljw20180420 commented 2 months ago

It works.