rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.44k stars 903 forks source link

[FEA] `read_json` should output all-nulls columns for the schema columns that do not match with the input JSON #17341

Open ttnghia opened 4 hours ago

ttnghia commented 4 hours ago

This is similar to https://github.com/rapidsai/cudf/issues/17091, but not the same. Currently, when the input JSON data has a column with the same name as in the input schema, it will be output without checking whether that column has the correct data type. For example, with the following input:

JSON data: {"a" : [1]}
Schema: STRUCT<a: LIST<STRUCT<INT>>>

Then read_json will output a LIST<INT8> column. The correct output should be an all-null column instead.

ttnghia commented 4 hours ago

Addressing this will also be the long term solution to fix https://github.com/NVIDIA/spark-rapids/issues/10901.