Open nvdbaranec opened 9 months ago
@mattahrens @abellina
@etseidl
This makes sense to me. Sounds beneficial even if we don't apply the pattern to all kernels. Could you help me understand the following?
The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.
Is this explaining the benefit over rle_stream
?
As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that can be reused. At the high level, all of various parquet kernels are structured similar to this:
The various *_stream.decode() functions are the key bottleneck in decoding parquet data. At the moment, the kernels we have mostly utilize the older/slower way of decoding these streams. The
rle_stream
class was developed to do this in a more parallel (and more confiurable) way, but only a few kernels use it at the moment because it does not currently handle dictionaries. The work for that is underway and very close to completion (https://github.com/rapidsai/cudf/issues/14950)decode_general_outputs
is a function that produces validity, list offset information and the mapping of source data (location in the parquet data page) to destination data (location in the final cudf column). The amount of work this function has to do varies greatly based on the characteristics of the input data - nullability, presence of lists, etc.PROCESS is something that varies from kernel-to-kernel. Essentially, the user-provided function that actually does the final data decoding.
We would like to implement this high level loop as a templated function that can be tailored to produce multiple, more optimal kernels based on they key data characteristics. For example:
There are several reasons to do this:
rle_stream
class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.The first candidates for using this would be two new micro-kernels: Fixed-width and fixed-width-with-dictionaries (the non-list case for both of them). We would like to get these in for 24.04 and then later on we can start refactoring the larger mass of existing kernels (especially the general-purpose
gpuDecodePageData
andgpuDecodeStringPageData