tlc-pack / relax

Apache License 2.0
193 stars 58 forks source link

[VM] Add per-op profiling support #422

Closed masahi closed 1 year ago

masahi commented 1 year ago

Adds per-op profiling support to Relax VM, in a way similar to how Relay VM is instrumented via the common profiling infra in the runtime.

Example output using DNNL BYOC, showing per-op timing:

Name                                       Duration (us)  Percent  Device  Count                                                        Argument Shapes  
fused_relax_nn_conv2d_relax_nn_relu_dnnl          454.59    52.14    cpu0      1  float32[1, 64, 56, 56], float32[64, 64, 3, 3], float32[1, 64, 56, 56]  
fused_relax_nn_conv2d_relax_nn_relu1_dnnl         368.99    42.32    cpu0      1  float32[1, 64, 56, 56], float32[64, 64, 3, 3], float32[1, 64, 54, 54]  
vm.builtin.check_tensor_info                        1.69     0.19    cpu0      1                                                 float32[1, 64, 56, 56]  
vm.builtin.match_shape                              1.49     0.17    cpu0      2                                                  float32[64, 64, 3, 3]  
vm.builtin.check_tensor_info                        1.23     0.14    cpu0      2                                                  float32[64, 64, 3, 3]  
vm.builtin.match_shape                              0.98     0.11    cpu0      1                                                 float32[1, 64, 56, 56]  
----------                                                                                                                                               
Sum                                               828.97    95.07              8                                                                         
Total                                             871.93             cpu0      1                                                                         

Configuration
-------------
Number of threads: 6
Executor: VM

Profiling also works over RPC, demonstrated below. This one uses the Relay translator and run the translated module without FuseOps.

Name                          Duration (us)  Percent    Device  Count                                 Argument Shapes  
conv2d1                          705,779.00    51.22  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 54, 54]  
conv2d                           669,589.00    48.60  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 56, 56]  
relu                                 683.00     0.05  hexagon0      1  float32[1, 64, 56, 56], float32[1, 64, 56, 56]  
relu1                                679.00     0.05  hexagon0      1  float32[1, 64, 54, 54], float32[1, 64, 54, 54]  
vm.builtin.check_tensor_info          28.00     0.00  hexagon0      1                          float32[1, 64, 56, 56]  
vm.builtin.match_shape                25.00     0.00  hexagon0      1                          float32[1, 64, 56, 56]  
----------                                                                                                             
Sum                            1,376,783.00    99.93                6                                                  
Total                                  0.00               cpu0      1                                                  
Total                          1,377,809.00           hexagon0      1                                                  

Configuration
-------------
Number of threads: 4
Executor: VM

In addition to a call to packed func, Relay VM also instruments costly VM-specific ops like AllocTensor, DeviceCopy etc. Equivalent instrumentations can be added to Relax VM if needed.

@YuchenJin @tqchen @tkonolige @csullivan

slyubomirsky commented 1 year ago

Generally looks good. Does the profiler work with the stateful API? It would be nice to have a test case using the stateful API over RPC with a tuple input/output to make sure everything we expect will be supported.

tqchen commented 1 year ago

Some followup note. for runtime minimization we might want to have some extra option to optionally disable the profiler part, likely a macro guard is sufficient given code is sufficiently isolated

MasterJH5574 commented 1 year ago

Hi @masahi, could you send the VM profiler to the unity branch? As this is a left-behind feature. Tracking issue #453

masahi commented 1 year ago

Hi @masahi, could you send the VM profiler to the unity branch? As this is a left-behind feature. Tracking issue #453

Sure I'll do that today