rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.43k stars 903 forks source link

[FEA] Implement a templated parquet decoding kernel suitable for reuse in micro-kernel optimization approach. #14953

Open nvdbaranec opened 9 months ago

nvdbaranec commented 9 months ago

As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that can be reused. At the high level, all of various parquet kernels are structured similar to this:

kernel(PageInfo p)
{
    // page setup, bounds checking (for skip_rows/num_rows), etc
    setup_code();

   while(there are still values to decode in p){
      def_levels = def_stream.decode(def_levels);
      rep_levels = p.has_lists ? rep_stream.decode(rep_levels);
      dict_indices = p.has_dict ? dict_stream.decode(dict_indices);
      decode_general_outputs(def_levels, rep_levels, dict_indices);

      PROCESS(p, def_levels, rep_levels, dict_indices);
   }
}

The various *_stream.decode() functions are the key bottleneck in decoding parquet data. At the moment, the kernels we have mostly utilize the older/slower way of decoding these streams. The rle_stream class was developed to do this in a more parallel (and more confiurable) way, but only a few kernels use it at the moment because it does not currently handle dictionaries. The work for that is underway and very close to completion (https://github.com/rapidsai/cudf/issues/14950)

decode_general_outputs is a function that produces validity, list offset information and the mapping of source data (location in the parquet data page) to destination data (location in the final cudf column). The amount of work this function has to do varies greatly based on the characteristics of the input data - nullability, presence of lists, etc.

PROCESS is something that varies from kernel-to-kernel. Essentially, the user-provided function that actually does the final data decoding.

We would like to implement this high level loop as a templated function that can be tailored to produce multiple, more optimal kernels based on they key data characteristics. For example:

template<// page data characteristics
                bool nullable,
                bool has_lists,
                bool has_dictionary,
                etc

                // parameters which can be tuned 
                int decode_buffer_size,
                int decode_warp_count,
                etc,

                // user provided PROCESS functor
                ProcessFunc Proc>

There are several reasons to do this:

The first candidates for using this would be two new micro-kernels: Fixed-width and fixed-width-with-dictionaries (the non-list case for both of them). We would like to get these in for 24.04 and then later on we can start refactoring the larger mass of existing kernels (especially the general-purpose gpuDecodePageData and gpuDecodeStringPageData

nvdbaranec commented 9 months ago

@mattahrens @abellina

nvdbaranec commented 9 months ago

@etseidl

vuule commented 9 months ago

This makes sense to me. Sounds beneficial even if we don't apply the pattern to all kernels. Could you help me understand the following?

The rle_stream class uses shared memory, so it is a big advantage to be able to have kernels that don't need a given feature (say, list decoding) to be able to use less.

Is this explaining the benefit over rle_stream?