rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.44k stars 904 forks source link

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

Closed revans2 closed 17 hours ago

revans2 commented 1 year ago

Describe the bug Pandas and Spark are very different in what gets returned when reading a JSON file. In pandas you provide essentially type hints, but it will return data for all columns in the file and only for columns in the file.

Spark has a schema that they want and when given a schema they want to read only the columns that match the schema and nothing else.

This causes two problems for Spark. The first one is a performance issue. If we want to read only one column out of 100, CUDF is going to materialize 100 columns and then we are going to throw away 99 of them. It would be great if we didn't need the memory for all of the extra columns or the computation needed to create them.

~The second problem is really an odd corner case. CUDF does not support a Table with just rows, but Spark does. So if there is a JSON file with no columns, but rows.~ ~{}~ ~{}~ ~We have no way to actually read that without some help.~ (see https://github.com/rapidsai/cudf/issues/5712)

It really would be nice to have a mode similar to how parquet, ORC, or CSV work where only the columns that are requested are returned and all of the requested columns are returned, even if the values are all nulls.

GregoryKimball commented 1 year ago

Thank you @revans2 for posting this. Would you please share if you exploring JSON file read or JSON strings column parsing when you encountered these issues?

The two topics about column projection and empty rows seem like they could be different issues. Would you please let me know if there is a reason to combine them?

karthikeyann commented 1 year ago

The peak memory usage of JSON reader will not reduce if we add this feature. JSON tokenizer, Tree construction and traversal will still be same. Only datatype parsing will be eliminated for non-selected columns. (saves some time, but not that big).

A quick analysis on json reader benchmark: In the screenshot attached of a benchmark run for 64 columns, filesize 1.28GB, datatype parsing takes ~21% (green) of the time. Also note that get_token_stream and get_tree_representation takes the peak memory usage. Not datatype parsing (in green). image Speedup could be <20%.

revans2 commented 1 year ago

@GregoryKimball this was when we were reading a file, but it is the same for both code paths in this case.

@karthikeyann

Thanks for the info. I understand that this might not have a huge impact to the memory usage or computation time. The majority of the memory usage and computation would be going into tokenizing and parsing the data. But even then if I have a case like the following

{"B": true, "A": [100, 200, 300, 400, 500.... hundreds of values ...]}
... thousands of rows ...
{"B": false, "A": [1, 2, 3, 4, 5...]}

If all I save is on not needing to materialize the output for A that is a win. May not be something that you want to prioritize super high for a performance standpoint.

The big problem for us in a really odd corner case which we did run into in one of our integration tests, and I am nervous we will run into with a customer at some point. Spark when it writes out JSON data by default will remove entries that are null. It is a space savings optimization. So if we end up with a row that are all nulls we end up with an output row of {}. This is not that uncommon in terms of a JSON optimizations. The problem shows up if all of the rows in a batch show up this way.

{}
{}
{}

I don't think it is likely to happen for large runs of rows, but with Spark it is very possible to have a few rows at the end of a file that show up like this and because of splits they would end up in a single batch. CUDF is unable to parse that batch and had us back the number of rows. The best that we could do as a work around is to count the number of input rows before sending the data to CUDF and catch this exception after it happens. But I am not 100% sure that it will work in all cases, especially if CUDF support comments, because Spark strips out lines that are fully comments in files. Then we would not know how many rows there are without some help.

Again probably not the highest priority, but it does end up being rather hacky to work around a lack of this type of feature.

GregoryKimball commented 1 year ago

Linking this issue to #5712 - which seems to capture the empty row issue well.

Let's leave this issue to focus on the column projection optimization.

karthikeyann commented 17 hours ago

Do PRs https://github.com/rapidsai/cudf/pull/14996 and https://github.com/rapidsai/cudf/pull/17029 fix this issue?

GregoryKimball commented 17 hours ago

Closed by #14996