rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.31k stars 887 forks source link

[BUG] double free or memory corruption when parsing some JSON #15750

Closed revans2 closed 4 months ago

revans2 commented 4 months ago

Describe the bug

We recently saw a SIGSEGV when trying to parse some customer data using JSON. As I stripped down and obfuscated the data it switched to the process aborting with the error.

double free or corruption (!prev)
Aborted (core dumped)

The code to reproduce this is

TEST_F(JsonReaderTest, JsonLinesCrash)
{
  std::string const fname = "from_json_1_10_simplified.json";

  cudf::io::json_reader_options options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
    .lines(true)
    .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
    .normalize_single_quotes(true)
    .normalize_whitespace(true)
    .mixed_types_as_string(true)
    .keep_quotes(true);

  auto result = cudf::io::read_json(options);
}

and the file is from_json_1_10_simplified.json

revans2 commented 4 months ago

Just be aware that I have hit other issues with this data too. IllegalMemoryAccess errors, a hang where it looks like we are in an infinite loop, and exceptions saying that the data looks to be incorrectly formatted (top level arrays mixed with structs appears to make this very unhappy).

I am happy to share more data/failures as needed. It just takes a lot of work to obfuscate it properly and still have an appropriate repro case. I suspect that they are almost all related to each other in some form.

shrshi commented 4 months ago

Thank you for the repro and the dataset, @revans2 I'm investigating the segfault and get back to you soon.

karthikeyann commented 4 months ago

I ran via compute-sanitizerwith the repro code. but there are no CUDA errors. I ran with gdb, and got line of segfault.

#16 operator()<int, cudf::io::json::device_json_column> (col=..., i=16, __closure=<synthetic pointer>) at /home/coder/cudf/cpp/src/io/json/json_column.cu:597
597         col.child_columns.clear();  // their references should be deleted too.

If this line is commented out, no segfault occurs.

karthikeyann commented 4 months ago

Minimal repro with valgrind errors

{ "Root": { "Key": [ { "EE": "A" } ] } }
{ "Root": { "Key": {  } } }
{ "Root": { "Key": [{ "YY": 1}] } }

Issue is not in the clear() function call, but accessing deleted reference of col. when clear() is called, the children columns are deleted, but their references still are stored in columns. That's what leads to the segfault, it tries to access a deleted object's reference. This algorithm could be rewritten as graphs algorithm to avoid this issue.

karthikeyann commented 4 months ago

Created a fix for memory errors #15798. Long term fix should be rewriting the device tree creation algorithm

karthikeyann commented 4 months ago

Discussed with @shrshi offline about the bug and need for refactor of make_device_json_column() in json_column.cu#L477

make_device_json_column function's complexity increased after addition of more features, and the current algorithm does not use any graph data-structure. It uses vectors to keep track of ignoring columns, mixed types, pruned columns and uses references. This code could be refactored, so it will be easy to add new features too.

Here are the list of features that is currently supported by make_device_json_column function and few other features that needs to be supported.

  1. constructs nested tree from parent_col_id vector
  2. Handle array of arrays too (pandas feature) (eg. [[1, 2, 3], [4, 5, 6]])
  3. null literals are ignored (pandas)
  4. mixed types support. (spark)
    • current implementation returns mixed types as strings. @revans2 clarified that mixed types need to return requested type and ignore all other types. (Eg. if list is requested, all struct, string, values needs to be ignored for same column path).
  5. Request any column as string type and corresponding children columns should be ignored (no extra memory, no extra processing). (spark)
  6. Column pruning
  7. TODO: optimization to batch the init_to_zero call with cub::DeviceMemcpy::Batched (consider batching scan of offsets too).
  8. OPTIONAL: To support map type. (Need more details about requirements from spark team).