Open Vipitis opened 3 months ago
I’ll take a look
Possible definitions of done for this issue:
datasets
libraryOption 1 is trivial. I think option 2 requires significant changes to the library.
Since you outlined something akin to option 2 in Expected behavior
I'm assuming that's what you'd like to see done. Is that right?
In the meantime, here's a solution for option 1:
import datasets
data_dir = './data/annotated/api'
features = datasets.Features({'id': datasets.Value(dtype='string'),
'name': datasets.Value(dtype='string'),
'author': datasets.Value(dtype='string'),
'description': datasets.Value(dtype='string'),
'tags': datasets.Sequence(feature=datasets.Value(dtype='string'), length=-1),
'likes': datasets.Value(dtype='int64'),
'viewed': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'date': datasets.Value(dtype='string'),
'time_retrieved': datasets.Value(dtype='string'),
'image_code': datasets.Value(dtype='string'),
'image_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'common_code': datasets.Value(dtype='string'),
'sound_code': datasets.Value(dtype='string'),
'sound_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_a_code': datasets.Value(dtype='string'),
'buffer_a_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_b_code': datasets.Value(dtype='string'),
'buffer_b_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_c_code': datasets.Value(dtype='string'),
'buffer_c_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'buffer_d_code': datasets.Value(dtype='string'),
'buffer_d_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'cube_a_code': datasets.Value(dtype='string'),
'cube_a_inputs': [{'channel': datasets.Value(dtype='int64'),
'ctype': datasets.Value(dtype='string'),
'id': datasets.Value(dtype='int64'),
'published': datasets.Value(dtype='int64'),
'sampler': {'filter': datasets.Value(dtype='string'),
'internal': datasets.Value(dtype='string'),
'srgb': datasets.Value(dtype='string'),
'vflip': datasets.Value(dtype='string'),
'wrap': datasets.Value(dtype='string')},
'src': datasets.Value(dtype='string')}],
'thumbnail': datasets.Value(dtype='string'),
'access': datasets.Value(dtype='string'),
'license': datasets.Value(dtype='string'),
'functions': datasets.Sequence(feature=datasets.Sequence(feature=datasets.Value(dtype='int64'), length=-1), length=-1),
'test': datasets.Value(dtype='string')})
datasets.load_dataset('json', data_dir=data_dir, features=features)
As pointed out by @hvaara, you can define explicit features so that you avoid the datasets
library having to infer them (from the first few samples).
Note that the feature inference is done from the first few samples of JSON-Lines on purpose, so that the entire data does not need to be parsed twice (it would be inefficient for very large datasets).
I understand this. But can there be a solution that doesn't require the end user to write this shema by hand(in my case there is some fields that contain a nested structure)?
Maybe offer an option to infer the shema automatically before loading the dataset. Or perhaps - trigger such a method when this error arises?
Is this "first few files" heuristics accessible via kwargs perhaps. Maybe an error that says `Cloud not cast some structure into feature shema, consider increasing shema_files to a large number or all".
There might be efficient implementations to solve this problem for larger datasets.
@Vipitis raised a good point on the HF Discord regarding the use of a dataset script to provide the schema during initialization. Using this approach requires setting trust_remote_code=True
, which is not allowed in certain evaluation frameworks.
For cases where using a dataset script is acceptable, would it be helpful to add functionality to the library (not necessarily in load_dataset
) that can automatically discover the feature definitions and output them, so you don't have to manually define them?
Alternatively, for situations where features need to be known at load-time without using a dataset script, another option could be loading the dataset schema from a file format that doesn't require trust_remote_code=True
.
Describe the bug
likely related to #6460
using
datasets.load_dataset("json", data_dir= ... )
with multiple.jsonl
files will error if one of the files (maybe the first file?) contains a full column of empty data.Steps to reproduce the bug
real world example: data is available in this PR-branch. Because my files are chunked by months, some months contain all empty data for some columns, just by chance - these are
[]
. Otherwise it's all the same structure.you get a long error trace, where in the middle it says something like
toy example: (on request)
Expected behavior
Some suggestions
as a workaround I have lazily implemented the following (essentially step 2)
this works fine for my usecase, but is potentially slower and less memory efficient for really large datasets (where this is unlikely to happen in the first place).
Environment info
datasets
version: 2.20.0huggingface_hub
version: 0.23.4fsspec
version: 2023.10.0