rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.37k stars 894 forks source link

[BUG] cuDF.read_json fails with cudaErrorInvalidValue invalid argument #17068

Open ayushdg opened 1 week ago

ayushdg commented 1 week ago

Describe the bug cudf.read_json fails on a specific file in my dataset

Steps/Code to reproduce bug

import cudf

cudf.read_json("/path/to/file.json.gz", lines=True)

RuntimeError: CUDA error encountered at: /__w/cudf/cudf/cpp/src/io/json/read_json.cu:318: 1 cudaErrorInvalidValue invalid argument

Expected behavior

import pandas as pd
pd.read_json("/path/to/file.json.gz", lines=True) # works

Environment overview (please complete the following information)

Environment details cudf 24.08, 24.12 (nightly) [ haven't checked with 24.10 but given 08, and 12 both fail I suspect the issue applies)

Additional context Data here: 2022-33_1303_en_all.json.gz

shrshi commented 6 days ago

On further investigation, this bug occurs due to an under-estimate in the size of the device buffer required to store the uncompressed data. Proposed solution: (i) Get estimate of uncompressed buffer size (fallback to heuristic if computing such an estimate is expensive) and (ii) Use realloc-and-retry logic from #16687 if the estimate falls short. We can extend this logic to multi-source compressed inputs as well.