load_dataset with multiple jsonlines files interprets datastructure too early

Vipitis commented 3 months ago

Describe the bug

likely related to #6460

using datasets.load_dataset("json", data_dir= ... ) with multiple .jsonl files will error if one of the files (maybe the first file?) contains a full column of empty data.

Steps to reproduce the bug

real world example: data is available in this PR-branch. Because my files are chunked by months, some months contain all empty data for some columns, just by chance - these are []. Otherwise it's all the same structure.

from datasets import load_dataset
ds = load_dataset("json", data_dir="./data/annotated/api")

you get a long error trace, where in the middle it says something like

TypeError: Couldn't cast array of type struct<id: int64, src: string, ctype: string, channel: int64, sampler: struct<filter: string, wrap: string, vflip: string, srgb: string, internal: string>, published: int64> to null

toy example: (on request)

Expected behavior

Some suggestions

give a better error message to the user
consider all files before deciding on a data structure for a given column.
if you encounter a new structure, and can't cast that to null, replace the null-hypothesis. (maybe something for pyarrow)

as a workaround I have lazily implemented the following (essentially step 2)

import os 
import jsonlines
import datasets

api_files = os.listdir("./data/annotated/api")

api_files = [f"./data/annotated/api/{f}" for f in api_files]

api_file_contents = []
for f in api_files:
    with jsonlines.open(f) as reader:
        for obj in reader:
            api_file_contents.append(obj)

ds = datasets.Dataset.from_list(api_file_contents)

this works fine for my usecase, but is potentially slower and less memory efficient for really large datasets (where this is unlikely to happen in the first place).

Environment info

datasets version: 2.20.0
Platform: Windows-10-10.0.19041-SP0
Python version: 3.9.4
huggingface_hub version: 0.23.4
PyArrow version: 16.1.0
Pandas version: 2.2.2
fsspec version: 2023.10.0

hvaara commented 3 months ago

I’ll take a look

hvaara commented 3 months ago

Possible definitions of done for this issue:

A fix so you can load your dataset specifically
A general fix for datasets similar to this in the datasets library

Option 1 is trivial. I think option 2 requires significant changes to the library.

Since you outlined something akin to option 2 in Expected behavior I'm assuming that's what you'd like to see done. Is that right?

In the meantime, here's a solution for option 1:

import datasets

data_dir = './data/annotated/api'

features = datasets.Features({'id': datasets.Value(dtype='string'),
 'name': datasets.Value(dtype='string'),
 'author': datasets.Value(dtype='string'),
 'description': datasets.Value(dtype='string'),
 'tags': datasets.Sequence(feature=datasets.Value(dtype='string'), length=-1),
 'likes': datasets.Value(dtype='int64'),
 'viewed': datasets.Value(dtype='int64'),
 'published': datasets.Value(dtype='int64'),
 'date': datasets.Value(dtype='string'),
 'time_retrieved': datasets.Value(dtype='string'),
 'image_code': datasets.Value(dtype='string'),
 'image_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'common_code': datasets.Value(dtype='string'),
 'sound_code': datasets.Value(dtype='string'),
 'sound_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_a_code': datasets.Value(dtype='string'),
 'buffer_a_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_b_code': datasets.Value(dtype='string'),
 'buffer_b_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_c_code': datasets.Value(dtype='string'),
 'buffer_c_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'buffer_d_code': datasets.Value(dtype='string'),
 'buffer_d_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'cube_a_code': datasets.Value(dtype='string'),
 'cube_a_inputs': [{'channel': datasets.Value(dtype='int64'),
   'ctype': datasets.Value(dtype='string'),
   'id': datasets.Value(dtype='int64'),
   'published': datasets.Value(dtype='int64'),
   'sampler': {'filter': datasets.Value(dtype='string'),
    'internal': datasets.Value(dtype='string'),
    'srgb': datasets.Value(dtype='string'),
    'vflip': datasets.Value(dtype='string'),
    'wrap': datasets.Value(dtype='string')},
   'src': datasets.Value(dtype='string')}],
 'thumbnail': datasets.Value(dtype='string'),
 'access': datasets.Value(dtype='string'),
 'license': datasets.Value(dtype='string'),
 'functions': datasets.Sequence(feature=datasets.Sequence(feature=datasets.Value(dtype='int64'), length=-1), length=-1),
 'test': datasets.Value(dtype='string')})

datasets.load_dataset('json', data_dir=data_dir, features=features)

albertvillanova commented 3 months ago

As pointed out by @hvaara, you can define explicit features so that you avoid the datasets library having to infer them (from the first few samples).

Note that the feature inference is done from the first few samples of JSON-Lines on purpose, so that the entire data does not need to be parsed twice (it would be inefficient for very large datasets).

Vipitis commented 3 months ago

I understand this. But can there be a solution that doesn't require the end user to write this shema by hand(in my case there is some fields that contain a nested structure)?

Maybe offer an option to infer the shema automatically before loading the dataset. Or perhaps - trigger such a method when this error arises?

Is this "first few files" heuristics accessible via kwargs perhaps. Maybe an error that says `Cloud not cast some structure into feature shema, consider increasing shema_files to a large number or all".

There might be efficient implementations to solve this problem for larger datasets.

hvaara commented 3 months ago

@Vipitis raised a good point on the HF Discord regarding the use of a dataset script to provide the schema during initialization. Using this approach requires setting trust_remote_code=True, which is not allowed in certain evaluation frameworks.

For cases where using a dataset script is acceptable, would it be helpful to add functionality to the library (not necessarily in load_dataset) that can automatically discover the feature definitions and output them, so you don't have to manually define them?

Alternatively, for situations where features need to be known at load-time without using a dataset script, another option could be loading the dataset schema from a file format that doesn't require trust_remote_code=True.

huggingface / datasets