[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings)

nvdbaranec commented 1 year ago

For parquet files that contain very large schemas with strings (either large numbers of columns, or large numbers of nested columns) we pay a very heavy price postprocessing the string data after the core decode kernels runs.

Essentially, the "decode" process for strings is just emitting a large array of pointer/size pairs that are then passed to other cudf functions to reconstruct actual columns. The problem is that we are doing this with no batching - each output string column results in an entire cudf function call (make_strings_column) with multiple internal kernel calls each. In situations with thousands of columns, this gets very expensive.

In the image above, the green span represents the time spent in the decode kernel and the time spent in all of the make_strings_column calls afterwards. The time is totally dominated by the many many calls to make_strings_column (the red span).

Ideally, we would have some kind of batched interface to make_strings_column (make_strings_columns ?) that can do the work for the thousands of output columns coalesced into fewer kernels.

On a related note, the area under the blue line represents a similar problem involving preprocessing the file (thousands of calls to thrust::reduce and thrust::exclusive_scan_by_key). This has been largely addressed by this PR https://github.com/rapidsai/cudf/pull/12931

etseidl commented 1 year ago

@nvdbaranec I'm curious if #13302 had any impact on the file used to generate the profile above.

GregoryKimball commented 7 months ago

@vuule recently conducted some experiments using an internal stream pool to hide latencies during column buffer allocation. Perhaps evaluating string types with larger column counts would show a bigger signal.

rapidsai / cudf

[FEA] Performance issue with the Parquet reader for very large schemas (especially when containing strings) #13024