openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
7.02k stars 2.21k forks source link

[Performance]:Why is there a reorder op after variadicsplit op ? #24412

Closed sitabulaixizawaluduo closed 2 weeks ago

sitabulaixizawaluduo commented 5 months ago

OpenVINO Version

2024.0.0

Operating System

Ubuntu 22.04 (LTS)

Device used for inference

CPU

OpenVINO installation

Build from source

Programming Language

Python

Hardware Architecture

x86 (64 bits)

Model used

recommend

Model quantization

No

Target Platform

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Gold 6442Y Stepping: 8 Frequency boost: enabled CPU MHz: 2601.000 CPU max MHz: 2601.0000 CPU min MHz: 800.0000 BogoMIPS: 5200.00 Virtualization: VT-x L1d cache: 2.3 MiB L1i cache: 1.5 MiB L2 cache: 96 MiB L3 cache: 120 MiB NUMA node0 CPU(s): 0-23,48-71 NUMA node1 CPU(s): 24-47,72-95 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx1 6 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_l m abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ib pb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi 2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha _ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local spl it_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx 512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid bus_lock_detect cldemo te movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr avx512_fp16 amx_tile flush_l 1d arch_capabilities

Performance issue description

I found that the VariadicSplit op is slower in version 2024.0.0 than in 2023.0.0. I processed the same onnx file with both versions of mo, 2023 and 2024, and at this point the structure of the two is the same when I look at it through Netron, 2024 split2024 2023 split2023 but when I processed it through benchmark_app,

taskset -c 0-23 benchmark_app -m split.xml -report_type detailed_counters -nstreams 24 -nthreads 24 -hint none -exec_graph_path benchmark_new.xml

and looked at the benchmark.xml file, I found that the 2023 version has an extra reorder node after the split op. 2024 benchmark2024 2023 benchmark2023

detailed.csv 2024 image 2023 image

1、What is the operation that causes the difference between the two version? 2、Why does the varadicsplit operation show not run in version 2023 but executed in version 2024?

Step-by-step reproduction

No response

Issue submission checklist

rkazants commented 5 months ago

Hi @sitabulaixizawaluduo, did you check the latest 2024.1 release if it has the same issue?

@dmitry-gorokhov, @mg-intel, please pay attention to performance degradation in 2024.0 release

Best regards, Roman

sitabulaixizawaluduo commented 5 months ago

Hi @sitabulaixizawaluduo, did you check the latest 2024.1 release if it has the same issue?

@dmitry-gorokhov, @mg-intel, please pay attention to performance degradation in 2024.0 release

Best regards, Roman

Thanks for reply ! I have not try 2024.1. Do you mean that 2024.0 does have a performance drop?

dmitry-gorokhov commented 5 months ago

@sitabulaixizawaluduo Reorder op is reponsible for memory copy in this context. In 2023 Split operation does nothing and real memory copy (from Split input to model outputs) is performed by Reorder ops. In 2024 behavior was changed (not sure why) and and Split operation itself performs the copy, so Reorders are not needed. Based on your benhcmark 2024 behavior provides worse perf, which we should obviously fix. Could you attach IR files so we could analyze the issue and propose a solution.

sitabulaixizawaluduo commented 5 months ago

@sitabulaixizawaluduo Reorder op is reponsible for memory copy in this context. In 2023 Split operation does nothing and real memory copy (from Split input to model outputs) is performed by Reorder ops. In 2024 behavior was changed (not sure why) and and Split operation itself performs the copy, so Reorders are not needed. Based on your benhcmark 2024 behavior provides worse perf, which we should obviously fix. Could you attach IR files so we could analyze the issue and propose a solution.

Thanks! You can use the code above to get the onnx file, and then build the IR file with the ovc command. ovc model.onnx --compress_to_fp16 False --output_model /data

dmitry-gorokhov commented 5 months ago

@sitabulaixizawaluduo Reorder op is reponsible for memory copy in this context. In 2023 Split operation does nothing and real memory copy (from Split input to model outputs) is performed by Reorder ops. In 2024 behavior was changed (not sure why) and and Split operation itself performs the copy, so Reorders are not needed. Based on your benhcmark 2024 behavior provides worse perf, which we should obviously fix. Could you attach IR files so we could analyze the issue and propose a solution.

Thanks! You can use the code above to get the onnx file, and then build the IR file with the ovc command. ovc model.onnx --compress_to_fp16 False --output_model /data

Which code? Maybe I missing smt.

YuChern-Intel commented 5 months ago

The code is shared in #24288 .

Sharing the code here.

`import numpy as np import onnx from onnx import helper from onnx import AttributeProto, TensorProto, GraphProto

index = [1, 1, 1, 1, 1, 1, 1, 1, 10, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 30, 30, 30, 1, 1, 1, 1, 1, 1, 1, 1, 30, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 30, 1, 30, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] split = np.array(index).astype(np.int32)

initializers= []

input_1 = helper.make_tensor_value_info('input_1', TensorProto.FLOAT, [256,279,81]) initializers = [onnx.helper.make_tensor( name='split', data_type=TensorProto.INT32, dims=[96], vals=split.flatten().tolist())]

outputs_list = [] for i in range(96): outputs_list.append(helper.make_tensor_valueinfo('output'+str(i+1), TensorProto.FLOAT, [256,index[i],81])) attr = helper.make_attribute("", 1.)

node_def = onnx.helper.make_node( "Split", inputs=["input1", 'split'], outputs=["output"+str(i+1) for i in range(96)], axis=np.int32(1), ) graph_def = helper.make_graph( [node_def], 'test-model', [input_1], outputs_list, initializer=initializers, ) model_def = helper.make_model(graph_def, producer_name='onnx-example',opset_imports=[helper.make_opsetid("", 13)]) onnx.checker.check_model(model_def) onnx.save(model_def, "signal_split_13_new.onnx")`

sitabulaixizawaluduo commented 5 months ago

I found that this reorder op doesn't appear all the time, if one of the branches after split continues after the split operation, then the reorder op doesn't appear.

LinGeLin commented 1 month ago

image

image

I don't know why reorder appears and takes up a large part of the time @dmitry-gorokhov Has it been fixed?

luweizhou2016 commented 3 weeks ago

@sitabulaixizawaluduo , Thanks! I think you are doing OP level perf test on Split. Spitting the [256,279,81] input along "279" dimension to 96 outputs with planar .

  1. About the perf degradation, can't reproduce. . I have ran the benchmark_app on my Intel(R) Xeon(R) Gold 6346 CPU numa node 0 with dumped onnx model. numactl -m 0 -C 0-15 ./benchmark_app -m ~/signal_split_13_new.onnx -nstreams 16 -nthreads 16 -hint none -t 10.

master : 2024.5.0-16666-a87851d56c9 2024.5.0-16666-a87851d56c9 1283.93 FPS releases/2023/0: 98a33ba7704ee62fe688175e9c7f53ed239ec4c7 693.43 FPS releases/2024/0: 2024.0.0-14585-5d852eb14c4 727.24 FPS perf data master > 2024.0 > 2023.0 So there is no perf issue on this single OP tests.

  1. For you question about reorder. It is related with memory descriptor and layout. In general, split on [256,279,81] with nchw on channel axis can't implement with no memory copy because batch is not "1", For the 2023.0, split didn't do the copy, just set the output memory descriptor to be memory descriptor with stride, not dense memory. So it requires the output consumer can support strided memory. The result node can not support strided memory. So reorder is needed to convert the strided memory to dense memory even with same layout nchw. That is why reorder is inserted. In fact, reorder is doing the real splitting work by copying data and split node just output strided memory descriptor to consumer . That is the reason split show No run.

For master branch, split would output dense memory to consumers so that consumer can reuse this memory. It means split node is doing the memory copy job. So that is why spit consume CPU and no extra reorder. I think we alos have some nchw layout copy optimization in recent release in split node to boost to copy. That is why perf has some improvment.

Hope this can resolve your question.

luweizhou2016 commented 3 weeks ago

@yuxu42 @mg-intel , I don't think this is a bug. It is just some internal supported descriptor behavior changed inside CPU plugin split node. Thanks!