huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.18k stars 2.67k forks source link

IterableDataset: cannot resolve features from list of numpy arrays #7100

Open VeryLazyBoy opened 2 months ago

VeryLazyBoy commented 2 months ago

Describe the bug

when resolve features of IterableDataset, got pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values error.

Traceback (most recent call last):
  File "test.py", line 6
    iter_ds = iter_ds._resolve_features()
  File "lib/python3.10/site-packages/datasets/iterable_dataset.py", line 2876, in _resolve_features
    features = _infer_features_from_batch(self.with_format(None)._head())
  File "lib/python3.10/site-packages/datasets/iterable_dataset.py", line 63, in _infer_features_from_batch
    pa_table = pa.Table.from_pydict(batch)
  File "pyarrow/table.pxi", line 1813, in pyarrow.lib._Tabular.from_pydict
  File "pyarrow/table.pxi", line 5339, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 374, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values

Steps to reproduce the bug

from datasets import Dataset
import numpy as np

# create list of numpy
iter_ds = Dataset.from_dict({'a': [[[1, 2, 3], [1, 2, 3]]]}).to_iterable_dataset().map(lambda x: {'a': [np.array(x['a'])]})
iter_ds = iter_ds._resolve_features()  # errors here

Expected behavior

features can be successfully resolved

Environment info

vishalmaurya850 commented 2 weeks ago

Assign this issue to me under Hacktoberfest with hacktoberfest label inserted on the issue