Open nvdbaranec opened 1 year ago
@nvdbaranec I'm curious if #13302 had any impact on the file used to generate the profile above.
@vuule recently conducted some experiments using an internal stream pool to hide latencies during column buffer allocation. Perhaps evaluating string types with larger column counts would show a bigger signal.
For parquet files that contain very large schemas with strings (either large numbers of columns, or large numbers of nested columns) we pay a very heavy price postprocessing the string data after the core decode kernels runs.
Essentially, the "decode" process for strings is just emitting a large array of pointer/size pairs that are then passed to other cudf functions to reconstruct actual columns. The problem is that we are doing this with no batching - each output string column results in an entire cudf function call (
make_strings_column
) with multiple internal kernel calls each. In situations with thousands of columns, this gets very expensive.In the image above, the green span represents the time spent in the decode kernel and the time spent in all of the
make_strings_column
calls afterwards. The time is totally dominated by the many many calls tomake_strings_column
(the red span).Ideally, we would have some kind of batched interface to
make_strings_column
(make_strings_columns
?) that can do the work for the thousands of output columns coalesced into fewer kernels.On a related note, the area under the blue line represents a similar problem involving preprocessing the file (thousands of calls to
thrust::reduce
andthrust::exclusive_scan_by_key
). This has been largely addressed by this PR https://github.com/rapidsai/cudf/pull/12931