google-coral / libcoral

C++ API for ML inferencing and transfer-learning on Coral devices
https://coral.ai
Apache License 2.0
79 stars 43 forks source link

profiling_based_partitioner doesn't divide evenly the time of the segments. #23

Open HanChangHun opened 2 years ago

HanChangHun commented 2 years ago

Description

The diff_threshold_ns that is profiling_based_partitioner's option is not working well.

It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound). But it compares last_segment_latency and target_latency.

So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.

Maybe the source code(last_segment_latency - target_latency) should be changed.

Click to expand! ### Issue Type Bug ### Operating System Mendel Linux, Linux ### Coral Device M.2 Accelerator with dual Edge TPU ### Other Devices _No response_ ### Programming Language C++ ### Relevant Log Output ```shell segment latencies [1.4403, 1.3686, 0.354, 0.161] # difference is bigger than 1ms [1.2178, 1.3376, 0.9702, 0.0683, 0.1601] [1.1891, 1.3306, 0.8717, 0.1092, 0.0511, 0.1604] [2.9966, 6.0864] # difference is so big! [2.6992, 1.9771, 2.9165] [2.5653, 1.9029, 1.7592, 1.1261] [2.3772, 1.5753, 1.7227, 1.4968, 0.6235] ```
hjonnala commented 2 years ago

Hello @HanChangHun It could be due to input output latency. can you please share the latency results with single model benchmark for each segment file in txt file. Thanks! https://github.com/google-coral/edgetpu/issues/593#issuecomment-1137929277

HanChangHun commented 2 years ago

I changed the profile-based partitioner to perform the partitioning in a single edge tpu and to share the SRAM of single edge tpu. So, the latency is different from the usual profile-based partitioning of example inception v2.

But, It is difficult to see the big time gap caused by input-output data transfer time.

The logs are as follows:

2022-06-03 23:43:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.14, 0.88, 0.86
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         2.94 ms        0.268 ms         1000 inception_v2_224_quant_segment_0_of_2_edgetpu.tflite

2022-06-03 23:43:53
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.11, 0.85, 0.85
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         6.04 ms        0.223 ms         1000 inception_v2_224_quant_segment_1_of_2_edgetpu.tflite

Another example is inception v2 splitting in 4. The gap between the slowest latency and fastest latency is greater than 1ms. (I setted diff_threshold_ns as 1000000)

2022-06-03 23:53:31
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.41, 0.27, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         2.51 ms        0.269 ms         2698 inception_v2_224_quant_segment_0_of_4_edgetpu.tflite

2022-06-03 23:53:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.35, 0.26, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.89 ms        0.299 ms         2507 inception_v2_224_quant_segment_1_of_4_edgetpu.tflite

2022-06-03 23:53:48
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.32, 0.25, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.72 ms        0.239 ms         2811 inception_v2_224_quant_segment_2_of_4_edgetpu.tflite

2022-06-03 23:53:55
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48K (x8)
  L1 Instruction 32K (x8)
  L2 Unified 512K (x8)
  L3 Unified 16384K (x1)
Load Average: 0.27, 0.24, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Model         1.09 ms        0.168 ms         3925 inception_v2_224_quant_segment_3_of_4_edgetpu.tflite

Thank you for the fast response.

hjonnala commented 2 years ago

Hmm..profile-based partitioner is not intended to perform the partitioning in a single edge tpu. Please check this page for details and requirements to use this tool: https://github.com/google-coral/libcoral/blob/master/coral/tools/partitioner/README.md. Thanks!

HanChangHun commented 2 years ago

Thank you for your response.

I aimed to utilize the existing code with only one edgetpu. So I changed the code that utilizes multiple edge tpu into a code that utilizes only one edge tpu. However, there was no other modification, so I thought that the partitioning part in the existing code was not considering the latency of the slowest segment and the latency of the fastest latency.

I modified the existing code, so it would be difficult for you to answer. Thank you for your kind reply!

hjonnala commented 2 years ago

Can you please try this code with two TPUs and with two segments on inception v3 model and share the logs and single model benchmark results for output models. Thanks!

HanChangHun commented 2 years ago

This code doesn't contain lower bound and upper bound update codes. So, I was changed somewhere and run with co-compilation.

The result is like this. Inception V3 with 2 segments, Inception V3 with 3 segments, and Inception V3 with 4 segments. It looks split model evenly. And gap between slowest and fastest are not over diff_threshold_ns(=1000000).

# Inception V3 with 2 segments
# 24.1ms and 24.9ms
target_latency:  24704940.8, num_ops: [84 48], latencies: [24120441 24918119]

# Inception V3 with 3 segments
# 16ms, 16.4ms, 16.7ms
target_latency:  16749807.1125, num_ops: [65 38 29], latencies: [16061107 16405574 16782598]

# Inception V3 with 4 segments
# 12.8ms, 13.1ms, 13.3ms and 13.7ms
target_latency:  13378520.2, num_ops: [56 27 25 24], latencies: [12896665 13151985 13315528 13781784]

your code is very helpful! Thank you!

hjonnala commented 2 years ago

The diff_threshold_ns that is profiling_based_partitioner's option is not working well.

It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound). But it compares last_segment_latency and target_latency.

So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.

Maybe the source code(last_segment_latency - target_latency) should be changed.

Awesome, Fell free to submit a Pull Request for this bug for the developer's review. Thanks!