ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.31k stars 5.64k forks source link

[<Ray component: data] `ray.data.read_text` raise `numpy.core._exceptions._ArrayMemoryError: Unable to allocate` #46293

Open Ox0400 opened 3 months ago

Ox0400 commented 3 months ago

What happened + What you expected to happen

Crashed when load a large text file using ray.data.read_text.

I just want to do repartition without parse to json format for per lines to free CPU and Mem resources, and I known can use read_json.

Versions / Dependencies

ray==2.31.0

Reproduction script

root@a135306f9a92:/var/work# du -sh xxxx.json
447M    xxxx.json
root@a135306f9a92:/var/work# wc -l xxxx.json
51776 xxxx.json
root@a135306f9a92:/var/work# 
>>> ds = ray.data.read_text('xxxxxx.json')
>>> ds.schema()
2024-06-27 16:59:34,573 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-06-27_16-52-09_868831_7603/logs/ray-data
2024-06-27 16:59:34,574 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText]
Running 0:   0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s]2024-06-27 17:00:37,810(ERROR streaming_executor_state.py:455 -- An exception was raised from a task of operator "ReadText->SplitBlocks(67)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
                                                                                                   2024-06-27 17:00:37,821      ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/ray/data/dataset.py", line 2528, in schema
    base_schema = self._plan.schema(fetch_if_missing=False)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 353, in schema
    blocks_with_metadata, _, _ = self.execute_to_iterator()
  File "/usr/local/lib/python3.10/site-packages/ray/data/exceptions.py", line 86, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(ValueError): ray::ReadText->SplitBlocks(67)() (pid=8006, ip=172.17.0.3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406

During handling of the above exception, another exception occurred:

ray::ReadText->SplitBlocks(67)() (pid=8006, ip=172.17.0.3)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 438, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
    for block in blocks:
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 92, in do_read
    yield from call_with_retry(
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 197, in __call__
    yield from result
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 256, in read_task_fn
    yield from read_files(read_paths)
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 222, in read_files
    for block in read_stream(f, read_path):
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/text_datasource.py", line 41, in _read_stream
    builder.add(item)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 38, in add
    self._builder.add(item)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 86, in add
    self._compact_if_needed()
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 152, in _compact_if_needed
    columns = {
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 153, in <dictcomp>
    key: convert_udf_returns_to_numpy(col) for key, col in self._columns.items()
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/numpy_support.py", line 102, in convert_udf_returns_to_numpy
    raise ValueError(
ValueError: Failed to convert column values to numpy array: (['{"txt": "\\n\\"\\"\\"\\nA suite of tools for dealing with notebooks...\\n\\"\\"\\"\\n\\nimport gtk\\n\\ndef prepNotebook(notebook=None, group=1):\\n    \\"\\"\\"\\n    Setup a notebook for use in vw...): Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406.
>>> 
#The items and sub item types, length.
type(udf_return_col)=<class 'list'>  len(udf_return_col)=4900 
type(udf_return_col[0])=<class 'str'> len(udf_return_col[0])=2576

Issue Severity

High: It blocks me from completing my task.

Ox0400 commented 3 months ago

https://github.com/ray-project/ray/pull/46298#issuecomment-2194580368

Ox0400 commented 3 months ago

😊