rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

[BUG] `cudf.read_json` does not raise an exception with invalid data when `lines=True` and `engine='cudf'` #15820

Open dagardner-nv opened 6 months ago

dagardner-nv commented 6 months ago

Describe the bug cudf.read_json doesn't raise an exception when parsing invalid json when lines=True and engine='cudf'. Instead it returns a single row DF with an empty string value.

Setting lines=False raises a RuntimeError (should be a ValueError). Alternately setting engine='pandas' raises a ValueError.

Steps/Code to reproduce bug

from io import StringIO

import cudf

print(cudf.__version__)
invalid_payload = '{"not_valid":"json'

# Produces a single row DF
print("Testing lines=True, engine=cudf")
print(cudf.read_json(StringIO(invalid_payload), lines=True, engine='cudf'))

# Works as expected
print("Testing lines=False, engine=cudf")
try: 
    cudf.read_json(StringIO(invalid_payload), lines=False, engine='cudf')
except Exception as e:
    print(e)

# Works as expected
print("Testing lines=True, engine=pandas")
try:
    cudf.read_json(StringIO(invalid_payload), lines=True, engine='pandas')
except Exception as e:
    print(e)

Expected behavior A raised ValueError, although any exception is better than

Environment overview (please complete the following information)

Observed in versions 24.04.01 and 24.02.02

shrshi commented 1 week ago

Debugging update: On further investigation using a cpp repro with the same input as the python repro above, both lines=True and lines=False case result in tokens StructBegin StructMemberBegin FieldBegin FieldEnd StringBegin. When lines=False, the exception is thrown after the device_json_column constructed. I think we need to add an additional condition for invalid lines in JSONL case.

shrshi commented 4 days ago

Apart from the Error Token test that we have in the node tree algorithms, we also need to verify complete node levels i.e. ensure that Begin tokens have matching End tokens. Such a test can be achieved with device-side prefix sum operations on the tokens list, and should resolve this bug.