huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.23k stars 2.69k forks source link

TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array #6399

Open y-hwang opened 12 months ago

y-hwang commented 12 months ago

Describe the bug

Hi, I am preprocessing a large custom dataset with numpy arrays. I am running into this TypeError during writing in a dataset.map() function. I've tried decreasing writer batch size, but this error persists. This error does not occur for smaller datasets.

Thank you!

Steps to reproduce the bug

Traceback (most recent call last): File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3493, in _map_single writer.write_batch(batch) File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_writer.py", line 555, in write_batch arrays.append(pa.array(typed_sequence)) File "pyarrow/array.pxi", line 243, in pyarrow.lib.array File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_writer.py", line 184, in __arrow_array__ out = numpy_to_pyarrow_listarray(data) File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/features/features.py", line 1394, in numpy_to_pyarrow_listarray values = pa.ListArray.from_arrays(offsets, values) File "pyarrow/array.pxi", line 2004, in pyarrow.lib.ListArray.from_arrays TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array

Expected behavior

Type should not be a ChunkedArray

Environment info

datasets v2.14.5 arrow v1.2.3 pyarrow v12.0.1

diakt commented 4 months ago

Seconding encountering this issue.