Open HanChangHun opened 2 years ago
Hello @HanChangHun It could be due to input output latency. can you please share the latency results with single model benchmark for each segment file in txt file. Thanks! https://github.com/google-coral/edgetpu/issues/593#issuecomment-1137929277
I changed the profile-based partitioner to perform the partitioning in a single edge tpu and to share the SRAM of single edge tpu. So, the latency is different from the usual profile-based partitioning of example inception v2.
But, It is difficult to see the big time gap caused by input-output data transfer time.
The logs are as follows:
2022-06-03 23:43:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.14, 0.88, 0.86
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 2.94 ms 0.268 ms 1000 inception_v2_224_quant_segment_0_of_2_edgetpu.tflite
2022-06-03 23:43:53
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.11, 0.85, 0.85
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 6.04 ms 0.223 ms 1000 inception_v2_224_quant_segment_1_of_2_edgetpu.tflite
Another example is inception v2 splitting in 4. The gap between the slowest latency and fastest latency is greater than 1ms. (I setted diff_threshold_ns
as 1000000)
2022-06-03 23:53:31
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.41, 0.27, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 2.51 ms 0.269 ms 2698 inception_v2_224_quant_segment_0_of_4_edgetpu.tflite
2022-06-03 23:53:41
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.35, 0.26, 0.50
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 1.89 ms 0.299 ms 2507 inception_v2_224_quant_segment_1_of_4_edgetpu.tflite
2022-06-03 23:53:48
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.32, 0.25, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 1.72 ms 0.239 ms 2811 inception_v2_224_quant_segment_2_of_4_edgetpu.tflite
2022-06-03 23:53:55
Running ./single_model_benchmark
Run on (16 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48K (x8)
L1 Instruction 32K (x8)
L2 Unified 512K (x8)
L3 Unified 16384K (x1)
Load Average: 0.27, 0.24, 0.49
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Model 1.09 ms 0.168 ms 3925 inception_v2_224_quant_segment_3_of_4_edgetpu.tflite
Thank you for the fast response.
Hmm..profile-based partitioner is not intended to perform the partitioning in a single edge tpu. Please check this page for details and requirements to use this tool: https://github.com/google-coral/libcoral/blob/master/coral/tools/partitioner/README.md. Thanks!
Thank you for your response.
I aimed to utilize the existing code with only one edgetpu. So I changed the code that utilizes multiple edge tpu into a code that utilizes only one edge tpu. However, there was no other modification, so I thought that the partitioning part in the existing code was not considering the latency of the slowest segment and the latency of the fastest latency.
I modified the existing code, so it would be difficult for you to answer. Thank you for your kind reply!
Can you please try this code with two TPUs and with two segments on inception v3 model and share the logs and single model benchmark results for output models. Thanks!
This code doesn't contain lower bound and upper bound update codes. So, I was changed somewhere and run with co-compilation.
The result is like this. Inception V3 with 2 segments, Inception V3 with 3 segments, and Inception V3 with 4 segments. It looks split model evenly. And gap between slowest and fastest are not over diff_threshold_ns(=1000000).
# Inception V3 with 2 segments
# 24.1ms and 24.9ms
target_latency: 24704940.8, num_ops: [84 48], latencies: [24120441 24918119]
# Inception V3 with 3 segments
# 16ms, 16.4ms, 16.7ms
target_latency: 16749807.1125, num_ops: [65 38 29], latencies: [16061107 16405574 16782598]
# Inception V3 with 4 segments
# 12.8ms, 13.1ms, 13.3ms and 13.7ms
target_latency: 13378520.2, num_ops: [56 27 25 24], latencies: [12896665 13151985 13315528 13781784]
your code is very helpful! Thank you!
The
diff_threshold_ns
that is profiling_based_partitioner's option is not working well.It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound). But it compares
last_segment_latency
andtarget_latency
.So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.
Maybe the source code(
last_segment_latency - target_latency
) should be changed.
Awesome, Fell free to submit a Pull Request for this bug for the developer's review. Thanks!
Description
The
diff_threshold_ns
that is profiling_based_partitioner's option is not working well.It doesn't compare the difference (in ns) between the slowest segment (upper bound) and the fastest segment (lower bound). But it compares
last_segment_latency
andtarget_latency
.So I was able to get the result with the slowest segment is too slow and the fastest segment is speedy.
Maybe the source code(
last_segment_latency - target_latency
) should be changed.Click to expand!
### Issue Type Bug ### Operating System Mendel Linux, Linux ### Coral Device M.2 Accelerator with dual Edge TPU ### Other Devices _No response_ ### Programming Language C++ ### Relevant Log Output ```shell segment latencies [1.4403, 1.3686, 0.354, 0.161] # difference is bigger than 1ms [1.2178, 1.3376, 0.9702, 0.0683, 0.1601] [1.1891, 1.3306, 0.8717, 0.1092, 0.0511, 0.1604] [2.9966, 6.0864] # difference is so big! [2.6992, 1.9771, 2.9165] [2.5653, 1.9029, 1.7592, 1.1261] [2.3772, 1.5753, 1.7227, 1.4968, 0.6235] ```