ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.14k stars 5.8k forks source link

[Data] Python JSON fallback not supporting JSONL #48235

Open pcmoritz opened 1 month ago

pcmoritz commented 1 month ago

What happened + What you expected to happen

See repro below -- I would have expected the fallback to parse the file as JSONL (and fail because it doesn't have the expected format)

Versions / Dependencies

Ray 2.38.0

Reproduction script

First create a file like

["hello", "world"]
["ray", "rocks"]

as data.jsonl and then run

import ray.data
ds = ray.data.read_json("data.jsonl")
ds.count()

you will find an error like

RayTaskError(JSONDecodeError): ray::ExpandPaths() (pid=28783, ip=10.0.25.228)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/json_reader.py", line 113, in _read_with_pyarrow_read_json
    raise e
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/json_reader.py", line 82, in _read_with_pyarrow_read_json
    table = pa.json.read_json(
            ^^^^^^^^^^^^^^^^^^
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Column() changed from object to array in row 0

During handling of the above exception, another exception occurred:

ray::ExpandPaths() (pid=28783, ip=10.0.25.228)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 464, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/planner/plan_expand_paths_op.py", line 100, in expand_paths
    encoding_ratio = _estimate_encoding_ratio(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/planner/plan_expand_paths_op.py", line 193, in _estimate_encoding_ratio
    in_memory_size = logical_op.reader.estimate_in_memory_size(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/native_file_reader.py", line 166, in estimate_in_memory_size
    first_batch = next(batches)
                  ^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/native_file_reader.py", line 90, in read_paths
    yield from _read_paths(paths)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/native_file_reader.py", line 76, in _read_paths
    for batch in self.read_stream(file, path):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/json_reader.py", line 52, in read_stream
    yield from self._read_with_python_json(buffer)
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/anyscale/data/_internal/readers/json_reader.py", line 120, in _read_with_python_json
    parsed_json = json.load(BytesIO(buffer))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/json/__init__.py", line 293, in load
    return loads(fp.read(),
           ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 19)

This is because the Python fallback doesn't support JSONL: https://github.com/ray-project/ray/blob/002908ff57e3d64c5fa580d264f7389f26167340/python/ray/data/_internal/datasource/json_datasource.py#L108

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Superskyyy commented 3 weeks ago

Makes sense, although in which practical cases pyarrow read json would fail here and imagine a fallback jsonL in python native way would work? @pcmoritz

Superskyyy commented 3 weeks ago

And I wonder if we could use pandas read json -> pyarrow as the fallback instead of implementing a generic fallback in native python.

Superskyyy commented 2 weeks ago

@pcmoritz Friendly ping for input.