rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.44k stars 903 forks source link

[FEA] Improved performance for strings finder_warp_parallel_fn / contains_warp_parallel_fn kernels #15405

Open jlowe opened 7 months ago

jlowe commented 7 months ago

Is your feature request related to a problem? Please describe. A customer has a query that performs many string find/contains operations, often on long strings. Nsight traces show most of the GPU time is being spent in finder_warp_parallel_fn or contains_warp_parallel_fn, significantly more than Parquet decompress and decode which are typically the top GPU kernels.

Describe the solution you'd like Improved performance for these kernels.

mythrocks commented 7 months ago

This is definitely more in @davidwendt's wheelhouse than in mine. I'm trying to familiarize myself with finder_warp_parallell_fn<>.

GregoryKimball commented 7 months ago

Perhaps we need to continue the long strings improvements for find that were started in #13226 (also see story issue #13048). Would you please share any information you might have on the distribution of string lengths?

mythrocks commented 7 months ago

I'll explore the data and post more info here.

I'm still looking at the call stack, etc. There are certainly wins to be had by switching the query from using strings::find() to using strings::contains().

The user query is of the form:

CASE WHEN instr(lower(my_str), 'my_sub_str') > 0 THEN ...

I'd like to check the feasibility of translating instr(..., ...) > 0 to use strings::contains() in the Spark Plugin.

mythrocks commented 7 months ago

@jlowe mentioned this: @revans2 already has a fix/workaround in place in Spark-RAPIDS, to detect queries of the from instr(...,...) > 0, to convert them into a call to strings::contains(). It appears that this isn't performing much faster than strings::find() does. :/

mythrocks commented 7 months ago

I have done some exploration of the data in question, and the query.

  1. The query has about 25 invocations of instr, which amounts to 25 calls to strings::contains() per Spark input-row batch.
  2. Exploring the input column for strings::contains(): Out of a sampling of 102M rows, 99.98% of them were "_", i.e. nominally blank.
  3. The input column had only a handful of unique values (in my sample). Here's a breakdown of their counts and lengths. (I have redacted the actual values.)
    +--------------------+-----------------------+---------+
    |          input_str | char_length(input_str)|   count |
    +--------------------+-----------------------+---------+
    |                  _ |                      1|102542310|
    | AAAAAAAAAAAAAAAAAA |                     54|    13611|
    | BBBBBBBBBBBBBBBBBB |                     22|     1404|
    | XXXXXXXXXXXXXXXXXX |                     17|      702|
    | YYYYYYYYYYYYYYYYYY |                    149|      390|
    | ZZZZZZZZZZZZZZZZZZ |                    104|       39|
    +--------------------+-----------------------+---------+
  4. The column seems to have only ASCII characters.
  5. The average string length is 1.008, vastly skewed by the 99% "_" value. Ignoring that row, the average length is 52.02.

I'm not sure why/how finder_warp_parallel_fn or contains_warp_parallel_fn kernels landed up in the slow path, given that the average string length seems not to exceed the warp-parallelism threshold (64).

I think @revans2's changes have seen to it that strings::contains is called correctly. Thereafter, it should call the string-per-thread version.

I haven't run a profile on the sample yet. I'll try get that going tomorrow.

mythrocks commented 7 months ago

I'm working on a block-parallel version of contains() that looks a lot like the warp-parallel one. Testing it out now.

mythrocks commented 7 months ago

As an aside, I should mention that the data distributions I mentioned above can be ignored, for the moment. The sample is not representative of the user's data.

mythrocks commented 7 months ago

I have a naive block-parallel implementation here.

This change switches to block-parallel if the average string length reaches 256 or 512. (I've tried both.)

Here are some results from running STRING_BENCH with and without block-parallel STRINGS_BENCH.block-parallel.256.log STRINGS_BENCH.warp-parallel.log

It appears that benefits aren't apparent unless the average string sizes reach around 4K-8K. And that gets slightly worse at higher row counts: Row Count String Size Warp Parallel Time (ms) Block Parallel Time (ms)
4096 256 0.034 0.035
4096 512 0.039 0.040
4096 1024 0.048 0.056
4096 2048 0.070 0.069
4096 4096 0.110 0.094
4096 8192 0.183 0.137
262144 256 0.313 0.325
262144 512 0.503 0.515
262144 1024 0.902 1.6
262144 2048 1.74 2.25
262144 4096 3.47 3.53
mythrocks commented 7 months ago

From exploring the customer's data, it appears that the majority of the search input strings are under 256 bytes long, although there are outliers (some being 64K long). The average length amounts to 145. I don't think going wider than 256 threads per block is reasonable.

I've gotten some nsys traces for the block-parallel implementation, and compared them to the warp-parallel versions. It appears that block-parallel lags warp-per-string, even for short search strings. My test searched 4 Million input strings with an average length of 256.

I've also run ncu (NSight Compute CLI). I will attach the findings here, but it appears that the block-parallel kernel takes longer than warp-per-string (~1 ms vs 2.76 ms), but has higher Compute (SM) Throughput (52.8% vs 39%).

For block-parallel:

    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         6.48
    SM Frequency            cycle/nsecond         1.45
    Elapsed Cycles                  cycle    3,991,213
    Memory Throughput                   %        50.92
    DRAM Throughput                     %         7.40
    Duration                      msecond         2.75
    L1/TEX Cache Throughput             %        54.65
    L2 Cache Throughput                 %         2.78
    SM Active Cycles                cycle 3,978,530.51
    Compute (SM) Throughput             %        52.83
    ----------------------- ------------- ------------

    OPT   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   256
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                485,000
    Registers Per Thread             register/thread              18
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block        byte/block               0
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block              17
    Threads                                   thread     124,160,000
    Waves Per SM                                            1,684.03
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           10
    Block Limit Shared Mem                block          128
    Block Limit Warps                     block            4
    Theoretical Active Warps per SM        warp           32
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        85.35
    Achieved Active Warps Per SM           warp        27.31
    ------------------------------- ----------- ------------

    OPT   Estimated Speedup: 14.65%
          This kernel's theoretical occupancy is not impacted by any block limit. The difference between calculated
          theoretical (100.0%) and measured achieved occupancy (85.4%) can be the result of warp scheduling overheads
          or workload imbalances during the kernel execution. Load imbalances can occur between warps within a block
          as well as across blocks of the same kernel. See the CUDA Best Practices Guide
          (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
          optimizing occupancy.

For warp-per-string:

    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         6.46
    SM Frequency            cycle/nsecond         1.44
    Elapsed Cycles                  cycle    1,445,593
    Memory Throughput                   %        35.02
    DRAM Throughput                     %        20.44
    Duration                      usecond       998.78
    L1/TEX Cache Throughput             %        35.99
    L2 Cache Throughput                 %         7.33
    SM Active Cycles                cycle 1,438,799.26
    Compute (SM) Throughput             %        39.00
    ----------------------- ------------- ------------

    OPT   This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance
          of this device. Achieved compute throughput and/or memory bandwidth below 60.0% of peak typically indicate
          latency issues. Look at Scheduler Statistics and Warp State Statistics for potential reasons.

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   256
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                 60,625
    Registers Per Thread             register/thread              18
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block        byte/block               0
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    Threads                                   thread      15,520,000
    Waves Per SM                                              210.50
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           10
    Block Limit Shared Mem                block           16
    Block Limit Warps                     block            4
    Theoretical Active Warps per SM        warp           32
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        91.36
    Achieved Active Warps Per SM           warp        29.24
    ------------------------------- ----------- ------------

    INF   This kernel's theoretical occupancy is not impacted by any block limit.

I wonder if I might be missing a trick here, with the block-parallel implementation.

cc @davidwendt, @nvdbaranec, @hyperbolic2346, whom I've been consulting on this.

mythrocks commented 7 months ago

I've generated a local dataset with the search key distributions in much the same way as that of the reported slow case. (This includes the order of if-else clauses, with a similar match rate.)

At 4M rows, with an average string length of 256, with the search keys of 12-char average lengths, the total runtimes are a near match. It's not looking like the kernel runtimes have an appreciable effect on the total runtime. If there's anything afflicting the block-per-string approach, it might also be affecting the existing warp-per-string.

NSight Compute analysis did seem to indicate the following warning regarding Long Scoreboard Stalls:

On average, each warp of this kernel spends 11.7 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. Find the instruction producing the data being waited upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data accesses verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit rates by increasing data locality (coalescing), or by changing the cache configuration. Consider moving frequently used data to shared memory. This stall type represents about 63.5% of the total average of 18.4 cycles between issuing two instructions

I'm trying to understand what can be changed here, but I'm wondering if we should be considering an algorithmic fix:

  1. We might consider what @jlowe has mentioned vis-a-vis employing short-circuit evaluations of CASE WHEN.
  2. A longer term change might involve pre-processing the input strings for searches.
mythrocks commented 7 months ago

P.S. I think I left the impression that the string::contains kernel is "slow", in absolute terms. That isn't accurate.

More correctly, the profiles of the Spark tasks indicate that the contains kernel is being called 30K times per task (as per how the user query is written), presumably around 25 times per input batch (once per CASE WHEN clause). Each call runs in milliseconds, but the number of calls pushes the total time spent in said kernel to several seconds.

Even a small improvement to the kernel is likely to amplify, at that scale.

mythrocks commented 6 months ago

The first approach (processing 1 string/threadblock instead of 1 string/warp) was a bust. At the user's average string size of 144 characters, it appeared that too many threads in the block had too little to do.

The second approach (processing N strings in the same kernel, instead of running the single "contains_warp_parallel_fn" kernel N times) should have reduced the processing time (by amortizing the costs of kernel launch). This seemed like a bust. Tests at the user site indicated that this was taking slightly longer as well.

The current thought (hat tip to @nvdbaranec) is that this might have something to do with null string inputs. It's possible that GPU occupancy reduces when there are more null input rows than non-null. The null-threads exit early, and wait for the completion of non-null threads.
It might be possible to increase occupancy by reducing the threads-per-block for the kernel launch.
I'm awaiting clarification on the user's actual null counts. I'm trying to test the behaviour with nulls separately.

mythrocks commented 6 months ago
Null distribution   0% nulls 30% nulls 50% nulls 3 nulls out of 4 7 nulls out of 8 90% nulls
    (ms) (ms) (ms) (ms) (ms) (ms)
multi_contains_warp_parallel 256 threads/block 11.13 10.04 8.27 4.38 3.4 3.18
5 x contains_warp_parallel 256 threads/block 10.64 9.6 7.87 4.18 ??? ???
multi_contains_warp_parallel 32 threads/block 18.68 13.76 10.51 6.74 4.69 4.29
5 x contains_warp_parallel 32 threads/block 18.08 13.24 9.04 6.3 ??? ???

The fastest execution time to find 5 sub-strings across 1M input rows for a variety of null distributions seems to be to call the contains_warp_parallel_fn 5 times.

It appears that the null-row theory isn't completely accurate. :/