rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.43k stars 904 forks source link

[BUG] libcudf JSON reader crash with compressed data #16248

Open lithomas1 opened 4 months ago

lithomas1 commented 4 months ago

Describe the bug A clear and concise description of what the bug is.

The libcudf JSON reader is "crashing" (not sure if its technically a crash, but I'm getting a CUDA error)

RuntimeError: CUDA error encountered at: /home/coder/cudf/cpp/src/io/json/read_json.cu:142: 1 cudaErrorInvalidValue invalid argument

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

import cudf
cudf.read_json("baddf.json.gz", orient="records", lines=True, engine="cudf") # Doesn't work :(
pd.read_json("baddf.json.gz", orient="records", lines=True) # OK

Expected behavior A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Successful read, like with pandas.

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

My cudf is the latest cudf (from main).

Additional context

I think the issue might be with the specific data values (they are all integers, even the string/floating columns). I'm pretty sure libcudf can write all the data types (even the nested struct/list ones).

baddf.json.gz

Also, if you uncompress the file by hand, you are able to read it with cudf

wence- commented 4 months ago

Compute-sanitizer:

========= COMPUTE-SANITIZER
========= Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemcpyAsync.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x445b06]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaMemcpyAsync [0x6dabf]
=========                in /home/coder/.conda/envs/rapids/lib/libcudart.so.12
=========     Host Frame:cudf::io::json::detail::ingest_raw_input(cudf::device_span<char, 18446744073709551615ul>, cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul>, cudf::io::compression_type, unsigned long, unsigned long, rmm::cuda_stream_view) [0x1decfdf]
=========                in /home/coder/cudf/cpp/build/conda/cuda-12.2/release/libcudf.so
=========     Host Frame:cudf::io::json::detail::get_record_range_raw_input(cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul>, cudf::io::json_reader_options const&, rmm::cuda_stream_view) [0x1dee514]
=========                in /home/coder/cudf/cpp/build/conda/cuda-12.2/release/libcudf.so
=========     Host Frame:cudf::io::json::detail::read_batch(cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul>, cudf::io::json_reader_options const&, rmm::cuda_stream_view, cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible>) [0x1deeb75]
=========                in /home/coder/cudf/cpp/build/conda/cuda-12.2/release/libcudf.so
=========     Host Frame:cudf::io::json::detail::read_json(cudf::host_span<std::unique_ptr<cudf::io::datasource, std::default_delete<cudf::io::datasource> >, 18446744073709551615ul>, cudf::io::json_reader_options const&, rmm::cuda_stream_view, cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible>) [0x1df030a]
=========                in /home/coder/cudf/cpp/build/conda/cuda-12.2/release/libcudf.so
=========     Host Frame:cudf::io::read_json(cudf::io::json_reader_options, rmm::cuda_stream_view, cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible>) [0x1d2de18]
=========                in /home/coder/cudf/cpp/build/conda/cuda-12.2/release/libcudf.so
GregoryKimball commented 3 months ago

Thank you @lithomas1 for sharing issue. We haven't done much testing with compressed JSON inputs. There could be a straightforward solution here, and we will take a closer look as soon as we can.