>>> ds = ray.data.read_text('xxxxxx.json')
>>> ds.schema()
2024-06-27 16:59:34,573 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-06-27_16-52-09_868831_7603/logs/ray-data
2024-06-27 16:59:34,574 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText]
Running 0: 0%| | 0/1 [00:00<?, ?it/s]2024-06-27 17:00:37,810(ERROR streaming_executor_state.py:455 -- An exception was raised from a task of operator "ReadText->SplitBlocks(67)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
2024-06-27 17:00:37,821 ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
ray.data.exceptions.SystemException
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.10/site-packages/ray/data/dataset.py", line 2528, in schema
base_schema = self._plan.schema(fetch_if_missing=False)
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 353, in schema
blocks_with_metadata, _, _ = self.execute_to_iterator()
File "/usr/local/lib/python3.10/site-packages/ray/data/exceptions.py", line 86, in handle_trace
raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(ValueError): ray::ReadText->SplitBlocks(67)() (pid=8006, ip=172.17.0.3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406
During handling of the above exception, another exception occurred:
ray::ReadText->SplitBlocks(67)() (pid=8006, ip=172.17.0.3)
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 438, in _map_task
for b_out in map_transformer.apply_transform(iter(blocks), ctx):
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
for block in blocks:
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
for data in iter:
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
yield from self._block_fn(input, ctx)
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 92, in do_read
yield from call_with_retry(
File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 197, in __call__
yield from result
File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 256, in read_task_fn
yield from read_files(read_paths)
File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 222, in read_files
for block in read_stream(f, read_path):
File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/text_datasource.py", line 41, in _read_stream
builder.add(item)
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 38, in add
self._builder.add(item)
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 86, in add
self._compact_if_needed()
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 152, in _compact_if_needed
columns = {
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 153, in <dictcomp>
key: convert_udf_returns_to_numpy(col) for key, col in self._columns.items()
File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/numpy_support.py", line 102, in convert_udf_returns_to_numpy
raise ValueError(
ValueError: Failed to convert column values to numpy array: (['{"txt": "\\n\\"\\"\\"\\nA suite of tools for dealing with notebooks...\\n\\"\\"\\"\\n\\nimport gtk\\n\\ndef prepNotebook(notebook=None, group=1):\\n \\"\\"\\"\\n Setup a notebook for use in vw...): Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406.
>>>
#The items and sub item types, length.
type(udf_return_col)=<class 'list'> len(udf_return_col)=4900
type(udf_return_col[0])=<class 'str'> len(udf_return_col[0])=2576
What happened + What you expected to happen
Crashed when load a large text file using
ray.data.read_text
.I just want to do
repartition
without parse to json format for per lines to free CPU and Mem resources, and I known can useread_json
.Versions / Dependencies
Reproduction script
Issue Severity
High: It blocks me from completing my task.