Hi, I am preprocessing a large custom dataset with numpy arrays. I am running into this TypeError during writing in a dataset.map() function. I've tried decreasing writer batch size, but this error persists. This error does not occur for smaller datasets.
Thank you!
Steps to reproduce the bug
Traceback (most recent call last):
File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, kwds))
File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in _write_generator_to_queue
for i, result in enumerate(func(kwargs)):
File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3493, in _map_single
writer.write_batch(batch)
File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_writer.py", line 555, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 243, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_writer.py", line 184, in __arrow_array__
out = numpy_to_pyarrow_listarray(data)
File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/features/features.py", line 1394, in numpy_to_pyarrow_listarray
values = pa.ListArray.from_arrays(offsets, values)
File "pyarrow/array.pxi", line 2004, in pyarrow.lib.ListArray.from_arrays
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
Describe the bug
Hi, I am preprocessing a large custom dataset with numpy arrays. I am running into this TypeError during writing in a dataset.map() function. I've tried decreasing writer batch size, but this error persists. This error does not occur for smaller datasets.
Thank you!
Steps to reproduce the bug
Traceback (most recent call last): File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in _write_generator_to_queue for i, result in enumerate(func(kwargs)): File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3493, in _map_single writer.write_batch(batch) File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_writer.py", line 555, in write_batch arrays.append(pa.array(typed_sequence)) File "pyarrow/array.pxi", line 243, in pyarrow.lib.array File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/arrow_writer.py", line 184, in __arrow_array__ out = numpy_to_pyarrow_listarray(data) File "/n/home12/yhwang/.conda/envs/lib/python3.10/site-packages/datasets/features/features.py", line 1394, in numpy_to_pyarrow_listarray values = pa.ListArray.from_arrays(offsets, values) File "pyarrow/array.pxi", line 2004, in pyarrow.lib.ListArray.from_arrays TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array
Expected behavior
Type should not be a ChunkedArray
Environment info
datasets v2.14.5 arrow v1.2.3 pyarrow v12.0.1