Open abellina opened 10 months ago
It would be interesting to compare the rle_stream
approach to dictionary decoding to the approach in totalDictEntriesSize. The latter makes use of all warps for decoding work, and doesn't suffer from load balancing problems between warps, but it might be harder to save state and pick up again in a batch processing application.
I've been looking at the
rle_stream
class in order to decode dictionary streams in addition to repetition streams in the parquet decoder. This is a component of the work that @nvdbaranec has done here https://github.com/rapidsai/cudf/pull/13622, where we'd like to separate out at least a "fixed width" and a "fixed width dictionary encoded" pair of kernels.With the changes in
rle_stream
, the core of the logic is able to use more threads for the RLE stream decoder. Specifically, a first warp is in charge of generating a set of runs, and other warps are able to take each one of the runs and decode them in parallel. As part of the micro kernel work, we feel that focusing onrle_stream
decoder and its effects ongpuComputeStringPageBounds
,gpuComputePageSizes
and the use in the new fixed kernels, is a good first step to get the micro kernel work merged.This issue then is to get a new
rle_stream
into cuDF that can handle both repetition AND dictionary streams, and show that the performance impact is same or better than what we have now. We hope that having this decoder will help centralize code, helping cleanup the parquet code base.