Accelerate resample_by_picking through better ordering

inducer commented 3 years ago

resample_by_picking routinely shows up at the top of our GPU profiles. Here's an example from a run (mirgecom wave-eager, nel_1d = 24, 3D, order 3):

 GPU activities:   15.95%  1.72898s     28160  61.398us  3.4240us  133.22us  resample_by_picking
                   14.57%  1.57934s     26499  59.599us  4.4480us  117.47us  multiply
                   14.41%  1.56223s      9698  161.09us  16.000us  543.23us  diff
                   12.79%  1.38640s     21601  64.182us  4.4150us  118.56us  axpbyz
                   11.66%  1.26342s      5283  239.15us  120.54us  529.85us  grudge_assign_0
                    8.81%  954.48ms     23428  40.741us  1.6640us  81.888us  axpb
                    7.93%  859.20ms     10560  81.363us  60.831us  135.04us  resample_by_mat
                    7.84%  849.96ms      1760  482.93us  481.79us  541.57us  face_mass
                    2.16%  233.96ms        62  3.7735ms  1.3440us  11.178ms  [CUDA memcpy DtoH]
                    1.58%  171.51ms      2235  76.738us  1.6640us  87.391us  [CUDA memcpy DtoD]
                    1.19%  128.99ms      3523  36.612us  19.328us  37.952us  [CUDA memset]
                    0.49%  53.375ms       440  121.31us  120.67us  136.42us  grudge_assign_1
                    0.49%  53.364ms       440  121.28us  120.67us  136.19us  grudge_assign_2
                    0.07%  7.2597ms       127  57.162us  1.1520us  847.90us  [CUDA memcpy HtoD]
                    0.03%  3.5939ms        12  299.49us  13.696us  529.66us  nodes_0
                    0.01%  1.4677ms         6  244.62us  14.112us  528.73us  actx_special_sqrt
                    0.00%  359.81us         6  59.967us  5.0880us  115.14us  divide
                    0.00%  136.06us         1  136.06us  136.06us  136.06us  actx_special_exp

It's especially striking that it's at the top of the list because it touches lower-dimensional data. (surface vs volume) multiply has a similar number of calls, but it touches volume data, and it completes more quickly.

I think there are two opportunities here that we could try:

Currently, the kernel has an indirection on the read and the write end (see the (very simple) source). For surjective/onto connections, we can do away with the indirection on write by appropriately sorting the source indices.
Even for non-surjective connections, it's likely that we would benefit by sorting by the write index, to try to keep the writes as coalesced as possible.

IMO it's likely that this will have a benefit (but I obviously can't guarantee it). I think it's worth trying.

cc @lukeolson

majosm commented 3 years ago

Still experimenting with this, but so far I'm getting the sense that it's something other than the indirection that's causing resample_by_picking to be slow.

First I tried pre-sorting by batch.to_element_indices, and that didn't seem to have any effect. Next I tried progressively stripping down the loop (ignoring correctness) to see whether I could simplify it enough to make it disappear from the profile results. I stripped it down to this:

knl = make_loopy_program(
    """{[iel, idof]:
        0<=iel<nelements and
        0<=idof<n_to_nodes}""",
    "result[iel, idof] = ary[iel, idof]",
    [
        lp.GlobalArg("result", None,
            shape="nelements_result, n_to_nodes",
            offset=lp.auto),
        lp.GlobalArg("ary", None,
            shape="nelements_vec, n_from_nodes",
            offset=lp.auto),
        lp.ValueArg("nelements_result", np.int32),
        lp.ValueArg("nelements_vec", np.int32),
        lp.ValueArg("n_from_nodes", np.int32),
        "...",
        ],
    name="resample_by_picking")

(i.e., no indirection), and it still looks about the same in the profiling results.

(Note: nelements here should be the number of entries in batch.to_element_indices, but looking at the generated code I can't tell if it is. More on this below.)

Before:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   16.16%  149.58ms       908  164.73us  17.120us  498.11us  diff
                   15.78%  146.09ms      2560  57.065us  4.6710us  123.14us  resample_by_picking
                   12.37%  114.46ms      2499  45.802us  5.5670us  92.512us  multiply
                   11.44%  105.84ms       493  214.68us  106.69us  482.94us  grudge_assign_0
                   10.60%  98.111ms      2001  49.031us  5.5670us  92.192us  axpbyz
                    7.81%  72.307ms       960  75.319us  55.200us  124.96us  resample_by_mat
                    7.76%  71.836ms      2228  32.242us  3.3280us  66.976us  axpb
                    7.44%  68.819ms        72  955.81us  1.3440us  3.0474ms  [CUDA memcpy DtoH]
                    7.09%  65.586ms       160  409.91us  395.33us  465.12us  face_mass
                    1.46%  13.493ms       235  57.417us  3.8720us  68.735us  [CUDA memcpy DtoD]
                    0.60%  5.5327ms        50  110.65us  106.78us  124.29us  grudge_assign_1
                    0.48%  4.4395ms        40  110.99us  106.98us  124.26us  grudge_assign_2
                    0.41%  3.8396ms       129  29.764us  1.0240us  518.59us  [CUDA memcpy HtoD]
                    0.35%  3.2791ms        12  273.25us  14.016us  482.05us  nodes_0
                    0.14%  1.3395ms         6  223.25us  14.368us  481.44us  actx_special_sqrt
                    0.03%  292.25us       323     904ns     832ns  1.6640us  [CUDA memset]
                    0.03%  276.19us         6  46.031us  6.4000us  85.567us  divide
                    0.02%  213.18us        10  21.318us  20.960us  21.792us  reduce_kernel_stage1
                    0.01%  124.99us         1  124.99us  124.99us  124.99us  actx_special_exp
                    0.01%  72.831us        20  3.6410us  3.2960us  4.4160us  reduce_kernel_stage2

After:

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   15.15%  164.78ms       908  181.47us  17.087us  500.35us  diff
                   14.83%  161.36ms      2560  63.032us  4.7990us  122.34us  resample_by_picking
                   10.91%  118.63ms       493  240.63us  123.46us  484.73us  grudge_assign_0
                   10.60%  115.31ms      2499  46.142us  6.3670us  92.160us  multiply
                    9.09%  98.829ms      2001  49.389us  6.3670us  93.216us  axpbyz
                    9.03%  98.180ms      3592  27.333us  1.4720us  3.1643ms  [CUDA memcpy DtoH]
                    7.38%  80.274ms       960  83.618us  63.136us  124.64us  resample_by_mat
                    6.77%  73.647ms       160  460.30us  459.17us  463.71us  face_mass
                    6.68%  72.666ms      2228  32.614us  3.3280us  66.400us  axpb
                    3.57%  38.881ms      7040  5.5220us  3.2640us  8.4480us  take
                    3.21%  34.891ms      3649  9.5610us  1.0240us  494.24us  [CUDA memcpy HtoD]
                    1.25%  13.577ms       235  57.775us  3.8400us  68.383us  [CUDA memcpy DtoD]
                    0.57%  6.1862ms        50  123.72us  123.49us  124.48us  grudge_assign_1
                    0.45%  4.9483ms        40  123.71us  123.52us  124.22us  grudge_assign_2
                    0.30%  3.2806ms        12  273.39us  14.112us  482.46us  nodes_0
                    0.12%  1.3401ms         6  223.34us  14.304us  481.57us  actx_special_sqrt
                    0.03%  312.64us       323     967ns     928ns  1.6640us  [CUDA memset]
                    0.03%  277.79us         6  46.298us  6.4640us  86.272us  divide
                    0.02%  215.10us        10  21.510us  21.152us  22.048us  reduce_kernel_stage1
                    0.01%  125.09us         1  125.09us  125.09us  125.09us  actx_special_exp
                    0.01%  81.535us        20  4.0760us  3.7120us  4.6080us  reduce_kernel_stage2

(this was for 10 timesteps with nel_1d = 24, 3D, order 3).

I checked the loop sizes (nelements*n_to_nodes):

min_size = 10580
max_size = 719440
avg_size = 365010.0

compared to 1.46M dofs in the volume discretization.

Here is the code generated for the above loop:

#define lid(N) ((int) get_local_id(N))
#define gid(N) ((int) get_group_id(N))
#if __OPENCL_C_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#endif

__kernel void __attribute__ ((reqd_work_group_size(16, 1, 1))) resample_by_picking(__global double *__restrict__ result, int const result_offset, __global double const *__restrict__ ary, int const ary_offset, int const nelements_result, int const nelements_vec, int const n_from_nodes, int const n_to_nodes, int const nelements)
{
  if (-1 + -1 * lid(0) + n_to_nodes >= 0)
    for (int idof_outer = 0; idof_outer <= -1 + -1 * lid(0) + (15 + n_to_nodes + 15 * lid(0)) / 16; ++idof_outer)
      result[result_offset + n_to_nodes * gid(0) + 16 * idof_outer + lid(0)] = ary[ary_offset + n_from_nodes * gid(0) + 16 * idof_outer + lid(0)];
}

I'm a little bit confused that it's not using nelements (I'm passing it in explicitly to actx.call_loopy; I get an error saying it can't be deduced if I omit it). Any ideas?

inducer commented 3 years ago

Thanks for experimenting with this!

I'm a little bit confused that it's not using nelements (I'm passing it in explicitly to actx.call_loopy; I get an error saying it can't be deduced if I omit it). Any ideas?

It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.

As for removing the indirections, it's super confusing to me that that's not having an effect. It has to... that's a whole bunch of extra memory access that's now not happening (the index reads), and somehow that's not measurable? Wha?

The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.

majosm commented 3 years ago

It's OK that that doesn't appear in the kernel. It gets used in the "wrapper" code that computes the OpenCL "grid" bounds. Right now, the (simple) work decomposition is one group per element.

Ah, ok.

The only explanation I can think of is that there's something else very wrong with this kernel. Speaking of work decomposition, one element per group is stupid (i.e. much too small), especially for surface kernels. (cc @nchristensen) Was this 2D? Nvidia's visual profiler can produce a nice report of "what's wrong with your kernel". It might be worth looking at what it has to say.

It's 3D. I'll take a look into the visual profiler thing.

matthiasdiener commented 3 years ago

Note that the visual profiler (nvvp) isn't very usable from Lassen, the network latency for graphical apps is too high. It is usually better to generate a profile with nvprof, scp it to your local machine, and import it into your local nvvp.

Note that for the deep analysis, nvvp/nvprof can be flaky for Python applications. I've had luck with the following command:

$ export PYOPENCL_CTX=':1'
$ nvprof -f --analysis-metrics -o foo.nvvp --replay-mode application python examples/wave-eager.py

, which will create (or overwrite) the output profile in foo.nvvp.

Note that the deep analysis (the --analysis-metrics argument) is extremely slow, so you need to use small input parameters.

EDIT: You will need an updated cuda version:

$ ml load cuda/11

matthiasdiener commented 3 years ago

Finally was able to get some decent nvprof results on Lassen: **Command:** `$ nvprof --kernels resample_by_picking -m all --replay-mode application python ./wave-eager.py` **Output:** ``` Kernel: resample_by_picking Invocations Metric Name Metric Description Min Max Avg 2 inst_per_warp Instructions per warp 274.000000 274.000000 274.000000 2 branch_efficiency Branch Efficiency 80.00% 80.00% 80.00% 2 warp_execution_efficiency Warp Execution Efficiency 20.57% 20.57% 20.57% 2 warp_nonpred_execution_efficiency Warp Non-Predicated Execution Efficiency 19.71% 19.71% 19.71% 1 inst_replay_overhead Instruction Replay Overhead 0.095238 0.095238 0.095238 2 shared_load_transactions_per_request Shared Memory Load Transactions Per Request 0.000000 0.000000 0.000000 2 shared_store_transactions_per_request Shared Memory Store Transactions Per Request 0.000000 0.000000 0.000000 1 local_load_transactions_per_request Local Memory Load Transactions Per Request 0.000000 0.000000 0.000000 2 local_store_transactions_per_request Local Memory Store Transactions Per Request 0.000000 0.000000 0.000000 1 gld_transactions_per_request Global Load Transactions Per Request 1.500000 1.500000 1.500000 3 gst_transactions_per_request Global Store Transactions Per Request 1.000000 1.000000 1.000000 2 shared_store_transactions Shared Store Transactions 0 0 0 2 shared_load_transactions Shared Load Transactions 0 0 0 1 local_load_transactions Local Load Transactions 0 0 0 2 local_store_transactions Local Store Transactions 0 0 0 1 gld_transactions Global Load Transactions 180 180 180 3 gst_transactions Global Store Transactions 30 30 30 2 sysmem_read_transactions System Memory Read Transactions 0 0 0 1 sysmem_write_transactions System Memory Write Transactions 5 5 5 2 l2_read_transactions L2 Read Transactions 231 564 397 1 l2_write_transactions L2 Write Transactions 50 50 50 3 dram_read_transactions Device Memory Read Transactions 0 0 0 3 dram_write_transactions Device Memory Write Transactions 32 40 36 1 global_hit_rate Global Hit Rate in unified l1/tex 0.00% 0.00% 0.00% 1 local_hit_rate Local Hit Rate 0.00% 0.00% 0.00% 2640 gld_requested_throughput Requested Global Load Throughput 423.86MB/s 5.9027GB/s 3.9078GB/s 2640 gst_requested_throughput Requested Global Store Throughput 211.93MB/s 2.9520GB/s 1.9715GB/s 1 gld_throughput Global Load Throughput 1.3629GB/s 1.3629GB/s 1.3629GB/s 3 gst_throughput Global Store Throughput 232.60MB/s 234.51MB/s 233.24MB/s 2 local_memory_overhead Local Memory Overhead 0.00% 0.00% 0.00% 2 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00% 3 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 100.00% 100.00% 100.00% 2 l2_tex_write_hit_rate L2 Hit Rate (Texture Writes) 93.33% 100.00% 96.67% 3 dram_read_throughput Device Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s 3 dram_write_throughput Device Memory Write Throughput 248.11MB/s 312.68MB/s 279.88MB/s 3 tex_cache_throughput Unified cache to Multiprocessor throughput 4.5430GB/s 4.5803GB/s 4.5554GB/s 3 l2_tex_read_throughput L2 Throughput (Texture Reads) 1.0306GB/s 1.3629GB/s 1.2527GB/s 2 l2_tex_write_throughput L2 Throughput (Texture Writes) 232.60MB/s 234.51MB/s 233.55MB/s 2 l2_read_throughput L2 Throughput (Reads) 1.7634GB/s 4.2705GB/s 3.0220GB/s 1 l2_write_throughput L2 Throughput (Writes) 387.67MB/s 387.67MB/s 387.67MB/s 2 sysmem_read_throughput System Memory Read Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 sysmem_write_throughput System Memory Write Throughput 38.767MB/s 38.767MB/s 38.767MB/s 1 local_load_throughput Local Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s 2 local_store_throughput Local Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s 2 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s 2 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 gld_efficiency Global Memory Load Efficiency 33.33% 33.33% 33.33% 3 gst_efficiency Global Memory Store Efficiency 100.00% 100.00% 100.00% 3 tex_cache_transactions Unified cache to Multiprocessor transactions 150 150 150 2640 flop_count_dp Floating Point Operations(Double Precision) 0 0 0 2640 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0 2640 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0 2640 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 0 0 0 2640 flop_count_sp Floating Point Operations(Single Precision) 0 0 0 2640 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0 2640 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0 2640 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0 2640 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0 2 inst_executed Instructions Executed 1890 8220 5055 2 inst_issued Instructions Issued 2070 2070 2070 3 dram_utilization Device Memory Utilization Low (1) Low (1) Low (1) 1 sysmem_utilization System Memory Utilization Low (1) Low (1) Low (1) 2 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 3.17% 17.88% 10.53% 2 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 14.07% 16.96% 15.52% 2 stall_memory_dependency Issue Stall Reasons (Data Request) 35.32% 41.15% 38.23% 2 stall_texture Issue Stall Reasons (Texture) 0.00% 0.00% 0.00% 2 stall_sync Issue Stall Reasons (Synchronization) 0.00% 0.00% 0.00% 2 stall_other Issue Stall Reasons (Other) 0.21% 0.27% 0.24% 2 stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 31.85% 37.70% 34.77% 2 stall_pipe_busy Issue Stall Reasons (Pipe Busy) 0.00% 0.00% 0.00% 2 shared_efficiency Shared Memory Efficiency 0.00% 0.00% 0.00% 2640 inst_fp_32 FP Instructions(Single) 0 0 0 2640 inst_fp_64 FP Instructions(Double) 0 0 0 2640 inst_integer Integer Instructions 7560 113400 70690 2640 inst_bit_convert Bit-Convert Instructions 0 0 0 2640 inst_control Control-Flow Instructions 480 7200 4581 2640 inst_compute_ld_st Load/Store Instructions 600 9000 5727 2640 inst_misc Misc Instructions 600 9000 5727 2640 inst_inter_thread_communication Inter-Thread Instructions 0 0 0 2 issue_slots Issue Slots 2070 2070 2070 1 cf_issued Issued Control-Flow Instructions 150 150 150 1 cf_executed Executed Control-Flow Instructions 150 150 150 2 ldst_issued Issued Load/Store Instructions 210 210 210 2 ldst_executed Executed Load/Store Instructions 210 210 210 2 atomic_transactions Atomic Transactions 0 0 0 2 atomic_transactions_per_request Atomic Transactions Per Request 0.000000 0.000000 0.000000 2 l2_atomic_throughput L2 Throughput (Atomic requests) 0.00000B/s 0.00000B/s 0.00000B/s 2 l2_atomic_transactions L2 Transactions (Atomic requests) 0 0 0 3 l2_tex_read_transactions L2 Transactions (Texture Reads) 135 180 165 2 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.67% 0.74% 0.71% 2 stall_not_selected Issue Stall Reasons (Not Selected) 0.00% 0.00% 0.00% 2 l2_tex_write_transactions L2 Transactions (Texture Writes) 30 30 30 2640 nvlink_total_nratom_data_transmitted NVLink Total Nratom Data Transmitted 0 0 0 2640 nvlink_user_nratom_data_transmitted NVLink User Nratom Data Transmitted 0 0 0 2640 nvlink_total_ratom_data_transmitted NVLink Total Ratom Data Transmitted 0 0 0 2640 nvlink_user_ratom_data_transmitted NVLink User Ratom Data Transmitted 0 0 0 2640 nvlink_total_write_data_transmitted NVLink Total Write Data Transmitted 384 4992 413 2640 nvlink_user_write_data_transmitted NVLink User Write Data Transmitted 128 1664 137 2640 nvlink_total_response_data_received NVLink Total Response Data Received 1856 3376 1893 2640 nvlink_user_response_data_received NVLink User Response Data Received 1280 2080 1302 2640 flop_count_hp Floating Point Operations(Half Precision) 0 0 0 2640 flop_count_hp_add Floating Point Operations(Half Precision Add) 0 0 0 2640 flop_count_hp_mul Floating Point Operation(Half Precision Mul) 0 0 0 2640 flop_count_hp_fma Floating Point Operations(Half Precision FMA) 0 0 0 2640 inst_fp_16 HP Instructions(Half) 0 0 0 2 ipc Executed IPC 0.040125 0.062150 0.051138 2 issued_ipc Issued IPC 0.043946 0.052494 0.048220 2 issue_slot_utilization Issue Slot Utilization 1.10% 1.31% 1.21% 2 sm_efficiency Multiprocessor Activity 8.98% 16.06% 12.52% 2 achieved_occupancy Achieved Occupancy 0.015624 0.015625 0.015624 2 eligible_warps_per_cycle Eligible Warps Per Active Cycle 0.043946 0.052494 0.048220 2 shared_utilization Shared Memory Utilization Idle (0) Idle (0) Idle (0) 1 l2_utilization L2 Cache Utilization Low (1) Low (1) Low (1) 3 tex_utilization Unified Cache Utilization Low (1) Low (1) Low (1) 2 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1) 1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1) 2 tex_fu_utilization Texture Function Unit Utilization Idle (0) Idle (0) Idle (0) 1 special_fu_utilization Special Function Unit Utilization Idle (0) Idle (0) Idle (0) 3 half_precision_fu_utilization Half-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0) 1 single_precision_fu_utilization Single-Precision Function Unit Utilization Low (1) Low (1) Low (1) 1 double_precision_fu_utilization Double-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0) 3 flop_hp_efficiency FLOP Efficiency(Peak Half) 0.00% 0.00% 0.00% 3 flop_sp_efficiency FLOP Efficiency(Peak Single) 0.00% 0.00% 0.00% 3 flop_dp_efficiency FLOP Efficiency(Peak Double) 0.00% 0.00% 0.00% 2 sysmem_read_utilization System Memory Read Utilization Idle (0) Idle (0) Idle (0) 1 sysmem_write_utilization System Memory Write Utilization Low (1) Low (1) Low (1) 2 stall_sleeping Issue Stall Reasons (Sleeping) 0.00% 0.00% 0.00% 2 inst_executed_global_loads Warp level instructions for global loads 120 120 120 2 inst_executed_local_loads Warp level instructions for local loads 0 0 0 2 inst_executed_shared_loads Warp level instructions for shared loads 0 0 0 3 inst_executed_surface_loads Warp level instructions for surface loads 0 0 0 3 inst_executed_global_stores Warp level instructions for global stores 30 30 30 3 inst_executed_local_stores Warp level instructions for local stores 0 0 0 2 inst_executed_shared_stores Warp level instructions for shared stores 0 0 0 3 inst_executed_surface_stores Warp level instructions for surface stores 0 0 0 3 inst_executed_global_atomics Warp level instructions for global atom and atom cas 0 0 0 3 inst_executed_global_reductions Warp level instructions for global reductions 0 0 0 3 inst_executed_surface_atomics Warp level instructions for surface atom and atom cas 0 0 0 3 inst_executed_surface_reductions Warp level instructions for surface reductions 0 0 0 2 inst_executed_shared_atomics Warp level shared instructions for atom and atom CAS 0 0 0 3 inst_executed_tex_ops Warp level instructions for texture 0 0 0 3 dram_read_bytes Total bytes read from DRAM to L2 cache 0 0 0 3 dram_write_bytes Total bytes written from L2 cache to DRAM 1024 1280 1152 2 global_load_requests Total number of global load requests from Multiprocessor 120 120 120 2 local_load_requests Total number of local load requests from Multiprocessor 0 0 0 5 surface_load_requests Total number of surface load requests from Multiprocessor 0 0 0 3 global_store_requests Total number of global store requests from Multiprocessor 30 30 30 3 local_store_requests Total number of local store requests from Multiprocessor 0 0 0 3 surface_store_requests Total number of surface store requests from Multiprocessor 0 0 0 2 global_atomic_requests Total number of global atomic requests from Multiprocessor 0 0 0 2 global_reduction_requests Total number of global reduction requests from Multiprocessor 0 0 0 5 surface_atomic_requests Total number of surface atomic requests from Multiprocessor 0 0 0 5 surface_reduction_requests Total number of surface reduction requests from Multiprocessor 0 0 0 2 l2_global_load_bytes Bytes read from L2 for misses in L1 for global loads 4320 5760 5040 1 l2_local_load_bytes Bytes read from L2 for misses in L1 for local loads 0 0 0 5 l2_surface_load_bytes Bytes read from L2 for misses in L1 for surface loads 0 0 0 5 l2_global_atomic_store_bytes Bytes written to L2 from L1 for global atomics 0 0 0 3 l2_local_global_store_bytes Bytes written to L2 from L1 for local and global stores. 960 960 960 5 l2_surface_store_bytes Bytes read from L2 for misses in L1 for surface stores 0 0 0 2 sysmem_read_bytes System Memory Read Bytes 0 0 0 1 sysmem_write_bytes System Memory Write Bytes 160 160 160 2 l2_tex_hit_rate L2 Cache Hit Rate 99.05% 100.00% 99.52% 2 texture_load_requests Total number of texture Load requests from Multiprocessor 0 0 0 1 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Idle (0) Idle (0) Idle (0) ```

inducer / meshmode

Accelerate resample_by_picking through better ordering #103