rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.28k stars 884 forks source link

[BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request #15260

Open revans2 opened 6 months ago

revans2 commented 6 months ago

Describe the bug This is very similar to https://github.com/rapidsai/cudf/issues/14239, and because that is not done, then it is fine for this to be a dupe of that.

In Spark we are handed a read schema and some JSON data. Our goal is to pull out the parts of the JSON data that match the read schema. But for strings, this gets to be a little complicated, and any type can be coerced into a string. If the data is an array it is coerced into a string by converting the tokens to a JSON formatted string, if the data is a dict it is coerced into a string the same way.

mixed_types_as_string was added in part to help make this happen, especially in the case of nested types. But that appears to only work at a top level column.

  std::string data = "{\"data\": {\"A\": 0, \"B\": 1}}\n{\"data\": [1,0]}\n";

  std::map<std::string, cudf::io::schema_element> data_types;
  std::map<std::string, cudf::io::schema_element> child_types;
  child_types.insert(std::pair{"LIST", cudf::io::schema_element{cudf::data_type{cudf::type_id::STRING, 0}, {}}});
  data_types.insert(std::pair{"data", cudf::io::schema_element{cudf::data_type{cudf::type_id::LIST, 0}, child_types}});

  cudf::io::json_reader_options in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{data.data(), data.size()})
      .dtypes(data_types)
      .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
      .normalize_single_quotes(true)
      .normalize_whitespace(true)
      .mixed_types_as_string(true)
      .keep_quotes(true)
      .lines(true);
  cudf::io::table_with_metadata result = cudf::io::read_json(in_options);

Throws an exception about trying to create a nested column using a fixed width column factory.

C++ exception with description "CUDF failure at: .../cpp/include/cudf/column/column_factories.hpp:342: Invalid, non-fixed-width type." thrown in the test body.
revans2 commented 6 months ago

Just to be clear this also happens for schemas that have a STRUCT with strings in them, not just LISTS.

karthikeyann commented 2 weeks ago

Created PR #16731 as fix.

karthikeyann commented 2 days ago

List children should be "element" instead of "LIST".

Besides that, PR #16545 will fix this issue (repro added as unit test in this PR).