pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
1.71k stars 291 forks source link

SDK + Inspector output time format is inconsistent with delegates #4504

Open Aeranthes opened 1 month ago

Aeranthes commented 1 month ago

Hi again - I'm think this is probably setup-agnostic, but as quick rundown, I'm running on an Android device with SDK enabled. I've got a binary application that can generate an etdump file while registering the XNNPack delegate, then I'm running a version of mobilenet_v2 that has been partitioned for the same delegate.

When I print out the resulting event blocks to find the operator timings, I'm seeing the following pattern:

event_block_name event_name raw p10 (ms)
Default Method::init [18.762083] 18.76208
Default Program::load_method [18.774635] 18.77464
Execute Transpose (ND, X32) #1 [1172000.0] 1172000
Execute Convolution (NHWC, F32) IGEMM #1 [802000.0] 802000
Execute Convolution (NHWC, F32) DWConv #1 [439000.0] 439000
Execute Convolution (NHWC, F32) GEMM #1 [357000.0] 357000
Execute Convolution (NHWC, F32) GEMM #2 [911000.0] 911000

The subject of the issue is specifically that the time format seems inconsistent - the Executorch calls and column headers are recorded in milliseconds, while from what I gather from XNNProfiler.cpp, the delegate calls are recorded in "PAL ticks" / system ticks?

Is this a bug, and/or is there a way to get the output in a more comparable time unit? Thanks for any assistance.

mcr229 commented 1 month ago

from what I understand the recorded timings in the xnnpack events are ticks, and the above can be interpreted as nanoseconds. But I think @Olivia-liu @GregoryComer can speak more on this, and any plans on unifying the timing units.

DzAvril commented 1 month ago

You can write a custom time converter like this

def convert_xnnpack_delegate_time(
    event_name: Union[str, int], input_time: Union[int, float]
) -> Union[int, float]:
    return input_time / (1000 * 1000)

And pass the convert to the Inspector like this

    inspector = Inspector(
        etdump_path=args.etdump_path,
        etrecord=args.etrecord_path,
        debug_buffer_path=args.debug_buffer_path,
        source_time_scale=TimeScale(args.source_time_scale),
        target_time_scale=TimeScale(args.target_time_scale),
        delegate_time_scale_converter=convert_xnnpack_delegate_time,
    )

This should be work.

Aeranthes commented 1 month ago

Thanks for the prompt response.

From the structure of the profiling output I generally see that a block of delegate calls is followed by a "DELEGATE_CALL" event in the profiling. Maybe I'm mistaken about this representing the time spent in the delegate from Executorch's side.

If I add up the delegate profiling times for the inference and interpret them as nanoseconds, the preceding block of XNNPack operators' overall time is quite different to the time given by DELEGATE_CALL (20.25ms vs 28.1ms).

This ~40% missing inference time from the overall method::execute time was one of the reasons I thought that I couldn't interpret the results as nanoseconds.

Full table below:   | event_block_name | event_name | raw | p10 (ms) -- | -- | -- | -- | -- 0 | Default | Method::init | [18.762083] | 18.76208 1 | Default | Program::load_method | [18.774635] | 18.77464 2 | Execute | Transpose (ND, X32) #1 | [1172000.0] | 1172000 3 | Execute | Convolution (NHWC, F32) IGEMM #1 | [802000.0] | 802000 4 | Execute | Convolution (NHWC, F32) DWConv #1 | [439000.0] | 439000 5 | Execute | Convolution (NHWC, F32) GEMM #1 | [357000.0] | 357000 6 | Execute | Convolution (NHWC, F32) GEMM #2 | [911000.0] | 911000 7 | Execute | Convolution (NHWC, F32) DWConv #2 | [471000.0] | 471000 8 | Execute | Convolution (NHWC, F32) GEMM #3 | [306000.0] | 306000 9 | Execute | Convolution (NHWC, F32) GEMM #4 | [646000.0] | 646000 10 | Execute | Convolution (NHWC, F32) DWConv #3 | [521000.0] | 521000 11 | Execute | Convolution (NHWC, F32) GEMM #5 | [445000.0] | 445000 12 | Execute | Add (ND, F32) #1 | [98000.0] | 98000 13 | Execute | Convolution (NHWC, F32) GEMM #6 | [491000.0] | 491000 14 | Execute | Convolution (NHWC, F32) DWConv #4 | [194000.0] | 194000 15 | Execute | Convolution (NHWC, F32) GEMM #7 | [154000.0] | 154000 16 | Execute | Convolution (NHWC, F32) GEMM #8 | [218000.0] | 218000 17 | Execute | Convolution (NHWC, F32) DWConv #5 | [155000.0] | 155000 18 | Execute | Convolution (NHWC, F32) GEMM #9 | [203000.0] | 203000 19 | Execute | Add (ND, F32) #2 | [37000.0] | 37000 20 | Execute | Convolution (NHWC, F32) GEMM #10 | [218000.0] | 218000 21 | Execute | Convolution (NHWC, F32) DWConv #6 | [152000.0] | 152000 22 | Execute | Convolution (NHWC, F32) GEMM #11 | [202000.0] | 202000 23 | Execute | Add (ND, F32) #3 | [34000.0] | 34000 24 | Execute | Convolution (NHWC, F32) GEMM #12 | [216000.0] | 216000 25 | Execute | Convolution (NHWC, F32) DWConv #7 | [111000.0] | 111000 26 | Execute | Convolution (NHWC, F32) GEMM #13 | [113000.0] | 113000 27 | Execute | Convolution (NHWC, F32) GEMM #14 | [578000.0] | 578000 28 | Execute | Convolution (NHWC, F32) DWConv #8 | [190000.0] | 190000 29 | Execute | Convolution (NHWC, F32) GEMM #15 | [209000.0] | 209000 30 | Execute | Add (ND, F32) #4 | [26000.0] | 26000 31 | Execute | Convolution (NHWC, F32) GEMM #16 | [245000.0] | 245000 32 | Execute | Convolution (NHWC, F32) DWConv #9 | [181000.0] | 181000 33 | Execute | Convolution (NHWC, F32) GEMM #17 | [209000.0] | 209000 34 | Execute | Add (ND, F32) #5 | [25000.0] | 25000 35 | Execute | Convolution (NHWC, F32) GEMM #18 | [236000.0] | 236000 36 | Execute | Convolution (NHWC, F32) DWConv #10 | [185000.0] | 185000 37 | Execute | Convolution (NHWC, F32) GEMM #19 | [213000.0] | 213000 38 | Execute | Add (ND, F32) #6 | [23000.0] | 23000 39 | Execute | Convolution (NHWC, F32) GEMM #20 | [236000.0] | 236000 40 | Execute | Convolution (NHWC, F32) DWConv #11 | [183000.0] | 183000 41 | Execute | Convolution (NHWC, F32) GEMM #21 | [306000.0] | 306000 42 | Execute | Convolution (NHWC, F32) GEMM #22 | [478000.0] | 478000 43 | Execute | Convolution (NHWC, F32) DWConv #12 | [283000.0] | 283000 44 | Execute | Convolution (NHWC, F32) GEMM #23 | [522000.0] | 522000 45 | Execute | Add (ND, F32) #7 | [33000.0] | 33000 46 | Execute | Convolution (NHWC, F32) GEMM #24 | [482000.0] | 482000 47 | Execute | Convolution (NHWC, F32) DWConv #13 | [291000.0] | 291000 48 | Execute | Convolution (NHWC, F32) GEMM #25 | [452000.0] | 452000 49 | Execute | Add (ND, F32) #8 | [30000.0] | 30000 50 | Execute | Convolution (NHWC, F32) GEMM #26 | [477000.0] | 477000 51 | Execute | Convolution (NHWC, F32) DWConv #14 | [151000.0] | 151000 52 | Execute | Convolution (NHWC, F32) GEMM #27 | [224000.0] | 224000 53 | Execute | Convolution (NHWC, F32) GEMM #28 | [370000.0] | 370000 54 | Execute | Convolution (NHWC, F32) DWConv #15 | [244000.0] | 244000 55 | Execute | Convolution (NHWC, F32) GEMM #29 | [365000.0] | 365000 56 | Execute | Add (ND, F32) #9 | [18000.0] | 18000 57 | Execute | Convolution (NHWC, F32) GEMM #30 | [369000.0] | 369000 58 | Execute | Convolution (NHWC, F32) DWConv #16 | [1001000.0] | 1001000 59 | Execute | Convolution (NHWC, F32) GEMM #31 | [375000.0] | 375000 60 | Execute | Add (ND, F32) #10 | [17000.0] | 17000 61 | Execute | Convolution (NHWC, F32) GEMM #32 | [382000.0] | 382000 62 | Execute | Convolution (NHWC, F32) DWConv #17 | [281000.0] | 281000 63 | Execute | Convolution (NHWC, F32) GEMM #33 | [717000.0] | 717000 64 | Execute | Convolution (NHWC, F32) GEMM #34 | [935000.0] | 935000 65 | Execute | Global Average Pooling (NWC, F32) #1 | [43000.0] | 43000 66 | Execute | Transpose (ND, X32) #2 | [3000.0] | 3000 67 | Execute | DELEGATE_CALL | [28.097344] | 28.09734 68 | Execute | OPERATOR_CALL | [0.003646] | 0.003646 69 | Execute | native_call_clone.out | [0.122396] | 0.122396 70 | Execute | OPERATOR_CALL | [0.126719] | 0.126719 71 | Execute | Fully Connected (NC, F32) GEMM #1 | [889000.0] | 889000 72 | Execute | DELEGATE_CALL | [0.92] | 0.92 73 | Execute | Method::execute | [29.22375] | 29.22375
Olivia-liu commented 1 month ago

Hey @Aeranthes thanks again for your question! Always glad to see someone is trying out the tool :). I think there's some overhead of making the delegate call, so that's why the total time in greater than the sums of the operators. @cccclai do you have any knowledge on this?

cccclai commented 1 month ago

Method::execute is the total time for all DELEGATE_CALL and OPERATOR_CALL. I don't think delegate call itself will have 8 ms overhead.

Olivia-liu commented 1 month ago

It should be correct to interpret those values of the delegated operators to be nanoseconds for xnnpack. We're working on making the time units consistent for delegates in the Inspector output. The 20.25ms vs 28.1ms gap is indeed weird. There might be some bugs unrelated to the time scale that's causing it. I don't think you did anything wrong since I can reproduce this on my side as well.