Open chentong319 opened 3 years ago
@chentong319 Thanks for this very detailed investigation.
Just to be clear, the slowdown above are using the current implementation with Dynamic vs Static vector shape, right?
It would also be interesting to learn why the later ops get increasingly slower? Are they increasingly larger?
From having looked at assembly, even the simplest loop:
for(int I=0; I<64; I++) a[I] = b[I] + c[I]
If unrolled by the backend, there is literally no scheduling of the reads (b and c) past any write (a) because memory disambiguation is quite broken once we are out of MLIR. For this particular example, the only way to move read past write operations was to hav MLIR unroll the loop.
With static sizes, we agree that memory bundling can reuse the memory used for temp tensors quite well. Say this is hard for arbitrary dynamic sizes, the cost of having malloc/free just before every op should technically have nearly the same footprint as the static size (at least same order of magnitude) and I don't believe the overhead associated with malloc/free should be too high if the amount of computations (such as matmul/conv) are nontrivial, as computations are at least O(N3) compared to O(1) of malloc/free costs.
Right now, our policy is to place malloc/free at beginning/end of function. That could be parametrized to before/after op in presence of dynamic tensors. Or malloc/free could be moved toward begin/end of functions vs begin/end of operation depending on static/dynamic tensors.
Hi Tong,
For the operations, now I focused on Mul, which is quite simple.
For unknown dims, Mul needs more operations to check broadcasting.
In next step, we need to check the bc file to verify what the final code is. How to generate bc file?
There are two ways:
onnx-mlir test.onnx --preserveBitcode
, you will get Bitcode files with and without optimizations.EmitLLVMIR
:
$ mlir-translate test.mlir -mlir-to-llvmir -o test.ll
$ llvm-as test.ll –o test.bc
But you cannot read test.bc
since it is not a human-readable file. To get a human-readable output from the bitcode, use llvm-dis test.bc -o test.dis
mlir-translate
, llvm-as
and llvm-dis
are tools in LLVM (in the same folder as FileCheck).
@tungld Thanks. I used --preserveBitcode
flag and llv-dis
to get the .ll
code.
My previous concern about the load for get dim for dynamic version is not an issue. llvm moved that load out of loop.
The check of broadcast is there for the dynamic. But the part for load and multiplication is the same in both dynamic and static version. For this model, if mlir provide support for symbolic dim size, the broadcast can be eliminated. The shape of the two inputs come from the same tensor.
One concern for dynamic tensor is that their space can not be reused. If we bundle them, the total space might be huge.
When a memref is generated, memref.alloc is inserted. However, the dealloc is not added, unless the output is not used. In general, the dealloc of memref
should be added at the end of of its live range, which can not be done locally with one op. Currently, we only have one free at the end of the program for the bundled space. We need to enhance the memory management code for dynamic shapes.
For unknown dims, Mul needs more operations to check broadcasting.
@tungld Interesting: that is if we have the ranks, we still need to test if the dim is != 1 as this would still force broadcasting, isn't it? This speak in favor of a N>1 dynamic number, as, for example, a model would typically know that the "sizes" of an image would be runtime but would not invoke "broadcasting" of pixels among there runtime image size, if you see what I mean...
@tungld Interesting: that is if we have the ranks, we still need to test if the dim is != 1 as this would still force broadcasting, isn't it?
Right.
This speak in favor of a N>1 dynamic number, as, for example, a model would typically know that the "sizes" of an image would be runtime but would not invoke "broadcasting" of pixels among there runtime image size, if you see what I mean...
I see. we can make the handling fo broadcasting a bit smarter here.
In general, the dealloc of memref should be added at the end of of its live range, which can not be done locally with one op.
@chentong319 if you want to quickly test this to see any performance change, you can use a MLIR pass, buffer-deallocation. It automatically inserts dealloc
just at the end of the live range.
@tungld thanks for the suggestion, let try existing code first before modifying our own.
@chentong319 if you want to quickly test this to see any performance change, you can use a MLIR pass, buffer-deallocation. It automatically inserts dealloc just at the end of the live range.
@tungld Excellent suggestion. I did the experiment in the following steps:
# v6.mlir is a dynamic version with 25 Conv
Debug/bin/onnx-mlir --EmitONNXIR v6.mlir
cp v6.onnx.mlir v6-dealloc-1.mlir
# Skip our own buffer related transformation
Debug/bin/onnx-mlir-opt --convert-onnx-to-krnl v6-dealloc-1.mlir > v6-dealloc-2.mlir
Debug/bin/onnx-mlir-opt -canonicalize --convert-krnl-to-affine v6-dealloc-2.mlir > v6-dealloc-3.mlir
#delete the our own dealloc before return
vi v6-dealloc-3.mlir
~/llvm-project/build/bin/mlir-opt -allow-unregistered-dialect --buffer-deallocation v6-dealloc-3.mlir > v6-dealloc-4.mlir
Debug/bin/onnx-mlir --preserveBitcode v6-dealloc-4.mlir
The result is
The slowdown came from other factor. I will check the impact if code is generated with no broadcast assumption.
@chentong319 great, looking forward to hear what we get now. This option can be run in all cases, after bundling terminates, in case there are un-bundled malloc/free
Some correction on the result using mlir buffer-deallocation pass. Even though there is no significant change in overall execution time, there is notable change in execution time for onnx.Mul op. I list the comparison of execution time for the first 22 Mul operations. The baseline the dynamic version compiled with onnx-mlir.
onnx.mul | vs. use dealloc | vs. static |
---|---|---|
1 | 1.118986 | 1.40364468 |
2 | 1.14244059 | 7.32108377 |
3 | 0.81698702 | 5.14030296 |
4 | 0.91209709 | 6.43561899 |
5 | 1.27468225 | 6.96805211 |
6 | 1.34997306 | 9.07578612 |
7 | 1.74610699 | 11.1936703 |
8 | 1.36749926 | 42.7274698 |
9 | 1.49653903 | 28.9776616 |
10 | 1.66238222 | 39.7137453 |
11 | 1.45076816 | 39.4212524 |
12 | 1.48723424 | 40.8446663 |
13 | 1.88158898 | 42.286802 |
14 | 1.38741428 | 37.8878981 |
15 | 1.51278562 | 40.3782091 |
16 | 1.53740106 | 44.6491948 |
17 | 1.45878552 | 43.0196265 |
18 | 1.51503751 | 174.397863 |
19 | 1.55616374 | 181.753731 |
20 | 1.57064024 | 172.387887 |
21 | 1.50640728 | 180.047742 |
22 | 1.72477148 | 128.940577 |
Inspired by this issue, I created a PR #758 to allow us passing static shapes to a dynamic model to avoid performance overhead. It may be used to test static/dynamic models.
I further investigated the impact of broadcast and found that the code for possible broadcasting is NOT the major source of slowdown. Execution time for onnx.Mul is reduce only about 15% on average. With dynamic shape, whether broadcasting occurs or not can not be determined at compile time. onnx-mlir generates code with selection operations. However, there is no broadcast occur in this model. We cab analyze where the shapes of two tensor are the same through data-flow. I simply changed the code in the investigation.
To me, the differences of ll file for dynamic without broadcasting and the one for static shape are on loop bound only. The static shape is 1x32x208x208 while the dynamic shape is ?x32x?x?. Loop for 1 is removed and bound check is moved to the end of the body for the constant upper bound. It seems to me that we have to look into what further optimization from the bitcode to the binary.
The .ll file for static shape is given below
br label %.preheader27, !dbg !176
.preheader27: ; preds = %.preheader28, %465
%446 = phi i64 [ 0, %.preheader28 ], [ %466, %465 ]
%447 = mul nuw nsw i64 %446, 43264
br label %.preheader, !dbg !177
.preheader: ; preds = %.preheader27, %462
%448 = phi i64 [ 0, %.preheader27 ], [ %463, %462 ]
%449 = mul nuw nsw i64 %448, 208
%450 = add nuw nsw i64 %447, %449
br label %451, !dbg !178
451: ; preds = %.preheader, %451
%452 = phi i64 [ 0, %.preheader ], [ %460, %451 ]
%453 = add nuw nsw i64 %450, %452, !dbg !179
%454 = getelementptr float, float* %53, i64 %453, !dbg !179
%455 = load float, float* %454, align 4, !dbg !179
%456 = getelementptr float, float* %51, i64 %453, !dbg !180
%457 = load float, float* %456, align 4, !dbg !180
%458 = fmul float %455, %457, !dbg !181
%459 = getelementptr float, float* %50, i64 %453, !dbg !182
store float %458, float* %459, align 4, !dbg !182
%460 = add nuw nsw i64 %452, 1, !dbg !178
%461 = icmp ult i64 %452, 207, !dbg !178
br i1 %461, label %451, label %462, !dbg !178
462: ; preds = %451
%463 = add nuw nsw i64 %448, 1, !dbg !177
%464 = icmp ult i64 %448, 207, !dbg !177
br i1 %464, label %.preheader, label %465, !dbg !177
465: ; preds = %462
%466 = add nuw nsw i64 %446, 1, !dbg !176
%467 = icmp ult i64 %446, 31, !dbg !176
br i1 %467, label %.preheader27, label %468, !dbg !176
468:
For dynamic shape without broadcast is given below
%548 = tail call i8* @malloc(i64 %57), !dbg !188
%549 = bitcast i8* %548 to float*, !dbg !188
br i1 %20, label %.preheader30, label %._crit_edge73, !dbg !189
.preheader30: ; preds = %._crit_edge83, %.us-lcssa.us
%550 = phi i64 [ %573, %.us-lcssa.us ], [ 0, %._crit_edge83 ]
%551 = mul i64 %550, %54
br i1 %60, label %.preheader29.us, label %.us-lcssa.us, !dbg !190
.preheader29.us: ; preds = %.preheader30, %._crit_edge67.us
%552 = phi i64 [ %555, %._crit_edge67.us ], [ 0, %.preheader30 ]
%553 = mul i64 %552, %12
%554 = add i64 %553, %551
br i1 %19, label %.preheader.us.us, label %._crit_edge67.us, !dbg !191
._crit_edge67.us: ; preds = %._crit_edge.us.us, %.preheader29.us
%555 = add nuw nsw i64 %552, 1, !dbg !190
%556 = icmp ult i64 %552, 31, !dbg !190
br i1 %556, label %.preheader29.us, label %.us-lcssa.us, !dbg !190
.preheader.us.us: ; preds = %.preheader29.us, %._crit_edge.us.us
%557 = phi i64 [ %560, %._crit_edge.us.us ], [ 0, %.preheader29.us ]
%558 = mul i64 %557, %5
%559 = add i64 %554, %558
br label %562, !dbg !192
._crit_edge.us.us: ; preds = %562
%560 = add nuw nsw i64 %557, 1, !dbg !191
%561 = icmp slt i64 %560, %4, !dbg !191
br i1 %561, label %.preheader.us.us, label %._crit_edge67.us, !dbg !191
562: ; preds = %562, %.preheader.us.us
%563 = phi i64 [ 0, %.preheader.us.us ], [ %571, %562 ]
%564 = add i64 %559, %563, !dbg !193
%565 = getelementptr float, float* %430, i64 %564, !dbg !193
%566 = load float, float* %565, align 4, !dbg !193
%567 = getelementptr float, float* %546, i64 %564, !dbg !194
%568 = load float, float* %567, align 4, !dbg !194
%569 = fmul float %566, %568, !dbg !195
%570 = getelementptr float, float* %549, i64 %564, !dbg !196
store float %569, float* %570, align 4, !dbg !196
%571 = add nuw nsw i64 %563, 1, !dbg !192
%572 = icmp slt i64 %571, %5, !dbg !192
br i1 %572, label %562, label %._crit_edge.us.us, !dbg !192
.us-lcssa.us: ; preds = %._crit_edge67.us, %.preheader30
%573 = add nuw nsw i64 %550, 1, !dbg !189
%574 = icmp slt i64 %573, %3, !dbg !189
br i1 %574, label %.preheader30, label %._crit_edge73, !dbg !189
._crit_edge73: ; preds = %.us-lcssa.us, %._crit_edge83
I checked the binary with objdump
. Similar to the .ll file, the innermost loop body is the same while the loop structure is slightly different. I cannot figure out why the performance can differ by 4x. I used this link for assembly reference.
The binary for the third onnx.Mul is list here. The offset of main_graph
is 6eb0
. I annotated some label in the objdump result to show the loop structure.
For static tensor:
7b6e: c0 e5 ff ff f2 15 brasl %r14,5f98 <InstrumentEntryPoint@plt>
7b74: a7 19 00 00 lghi %r1,0
7b78: a7 29 00 00 lghi %r2,0
7b7c: e3 d0 f0 a0 00 04 lg %r13,160(%r15)
7b82: e3 c0 f1 18 00 04 lg %r12,280(%r15)
7b88: e3 b0 f1 48 00 04 lg %r11,328(%r15)
7b8e: b9 04 00 31 lgr %r3,%r1
7b92: a7 49 00 00 lghi %r4,0
7b96: a7 59 ff ff lghi %r5,-1 //<main_graph+0xce6>
7b9a: b9 04 00 e3 lgr %r14,%r3
<main_graph+0xcee>: // innermost loop
7b9e: 78 0e b0 00 le %f0,0(%r14,%r11)
7ba2: ed 0e c0 00 00 17 meeb %f0,0(%r14,%r12)
7ba8: 70 0e d0 00 ste %f0,0(%r14,%r13)
7bac: 41 50 50 01 la %r5,1(%r5)
7bb0: 41 e0 e0 04 la %r14,4(%r14)
7bb4: ec 54 ff f5 cf 7d clgijl %r5,207,7b9e <main_graph+0xcee>
7bba: c2 4e 00 00 00 cf clgfi %r4,207
7bc0: 41 40 40 01 la %r4,1(%r4)
7bc4: 41 30 33 40 la %r3,832(%r3)
7bc8: a7 44 ff e7 jl 7b96 <main_graph+0xce6>
7bcc: c2 2e 00 00 00 1f clgfi %r2,31
7bd2: 41 20 20 01 la %r2,1(%r2)
7bd6: e3 10 14 00 2a 71 lay %r1,173056(%r1)
7bdc: a7 44 ff d9 jl 7b8e <main_graph+0xcde>
7be0: b9 04 00 2b lgr %r2,%r11
7be4: c0 e5 ff ff f8 3a brasl %r14,6c58 <free@plt>
7bea: b9 04 00 2c lgr %r2,%r12
7bee: c0 e5 ff ff f8 35 brasl %r14,6c58 <free@plt>
7bf4: a7 29 00 02 lghi %r2,2
7bf8: a7 39 00 00 lghi %r3,0
7bfc: c0 e5 ff ff f1 ce brasl %r14,5f98 <InstrumentEntryPoint@plt>
For dynamic tensors:
86f4: c0 e5 ff ff ec 52 brasl %r14,5f98 <InstrumentEntryPoint@plt>
86fa: b9 04 00 2d lgr %r2,%r13
86fe: c0 e5 ff ff f0 3d brasl %r14,6778 <malloc@plt>
8704: e3 20 f1 58 00 24 stg %r2,344(%r15)
870a: e5 58 f0 c8 00 00 cghsi 200(%r15),0
8710: a7 c4 00 90 jle 8830 <main_graph+0x1980>
8714: e3 10 f0 d8 00 04 lg %r1,216(%r15)
871a: eb 01 00 07 00 0d sllg %r0,%r1,7
8720: e3 00 f1 60 00 24 stg %r0,352(%r15)
8726: eb 11 00 02 00 0d sllg %r1,%r1,2
872c: e3 20 f1 80 00 04 lg %r2,384(%r15)
8732: eb 22 00 02 00 0d sllg %r2,%r2,2
8738: a7 59 00 00 lghi %r5,0
873c: e3 80 f1 70 00 24 stg %r8,368(%r15)
8742: e3 00 f1 58 00 04 lg %r0,344(%r15)
8748: a7 f4 00 24 j 8790 <main_graph+0x18e0>
874c: e3 50 f1 68 00 04 lg %r5,360(%r15)
8752: 41 50 50 01 la %r5,1(%r5)
8756: e3 40 f1 60 00 04 lg %r4,352(%r15)
875c: b9 08 00 04 agr %r0,%r4
8760: e3 30 f1 78 00 04 lg %r3,376(%r15)
8766: b9 08 00 34 agr %r3,%r4
876a: e3 30 f1 78 00 24 stg %r3,376(%r15)
8770: e3 30 f1 70 00 04 lg %r3,368(%r15)
8776: b9 08 00 34 agr %r3,%r4
877a: e3 30 f1 70 00 24 stg %r3,368(%r15)
8780: e3 50 f0 c8 00 20 cg %r5,200(%r15)
8786: e3 70 f1 40 00 04 lg %r7,320(%r15)
878c: a7 a4 00 52 jhe 8830 <main_graph+0x1980>
8790: e3 50 f1 68 00 24 stg %r5,360(%r15)
8796: ec 7c ff db 00 7c cgijnh %r7,0,874c <main_graph+0x189c>
879c: a7 99 00 00 lghi %r9,0
87a0: e3 e0 f1 70 00 04 lg %r14,368(%r15)
87a6: e3 d0 f1 78 00 04 lg %r13,376(%r15)
87ac: b9 04 00 50 lgr %r5,%r0
87b0: a7 f4 00 0f j 87ce <main_graph+0x191e>
87b4: b9 08 00 51 agr %r5,%r1
87b8: b9 08 00 d1 agr %r13,%r1
87bc: b9 08 00 e1 agr %r14,%r1
87c0: c2 9e 00 00 00 1f clgfi %r9,31
87c6: 41 90 90 01 la %r9,1(%r9)
87ca: a7 a4 ff c1 jhe 874c <main_graph+0x189c>
87ce: e5 58 f1 80 00 00 cghsi 384(%r15),0
87d4: a7 c4 ff f0 jle 87b4 <main_graph+0x1904>
87d8: a7 79 00 00 lghi %r7,0
87dc: b9 04 00 6e lgr %r6,%r14
87e0: b9 04 00 bd lgr %r11,%r13
87e4: b9 04 00 a5 lgr %r10,%r5
87e8: a7 c9 00 00 lghi %r12,0
87ec: a7 49 00 00 lghi %r4,0
87f0: e3 30 f1 80 00 04 lg %r3,384(%r15)
<main_graph+0x1946>: // innermost loop
87f6: 78 0c 60 00 le %f0,0(%r12,%r6)
87fa: ed 0c b0 00 00 17 meeb %f0,0(%r12,%r11)
8800: 70 0c a0 00 ste %f0,0(%r12,%r10)
8804: 41 40 40 01 la %r4,1(%r4)
8808: 41 c0 c0 04 la %r12,4(%r12)
880c: ec 43 ff f5 40 64 cgrjl %r4,%r3,87f6 <main_graph+0x1946>
8812: 41 70 70 01 la %r7,1(%r7)
8816: b9 08 00 a2 agr %r10,%r2
881a: b9 08 00 b2 agr %r11,%r2
881e: b9 08 00 62 agr %r6,%r2
8822: e3 70 f1 40 00 20 cg %r7,320(%r15)
8828: a7 44 ff e0 jl 87e8 <main_graph+0x1938>
882c: a7 f4 ff c4 j 87b4 <main_graph+0x1904>
8830: e3 20 f1 38 00 04 lg %r2,312(%r15)
8836: c0 e5 ff ff f2 11 brasl %r14,6c58 <free@plt>
883c: b9 04 00 28 lgr %r2,%r8
8840: c0 e5 ff ff f2 0c brasl %r14,6c58 <free@plt>
8846: a7 29 00 02 lghi %r2,2
884a: a7 39 00 00 lghi %r3,0
884e: c0 e5 ff ff eb a5 brasl %r14,5f98 <InstrumentEntryPoint@plt>
Tong, I checked with ResNet50 and saw the same issue for dynamic dimensions, but we had a good news at least.
onnx-mlir resnet50.onnx --shapeInformation='0:=-1,-1,224,224'
=> worked wellonnx-mlir resnet50.onnx --shapeInformation='0:1,3,-1,224'
=> same issue as Yoloonnx-mlir resnet50.onnx --shapeInformation='0:1,3,224,-1'
=> same issue as YoloSo I suppose that there was something wrong with the lowering for Conv, in particular, to handle dynamic dimensions for the image height and width. Perhaps, it wrongly computed loop bounds (two big values would cause an infinite run, or invalid values would cause segfault).
Our end-to-end test, which is make check-onnx-backend-dynamic
, only checks dynamic dimension for batch size, not height or width. So it did not catch this Yolo situation.
Given that assumption, I edited the numerical test for Conv to check dynamic image height, and it seems we got problems there. See PR #764. I believe we should fix the issue with Conv before going further debugging Yolo.
I think I found at least one issue: the shape inference for Conv is wrong when the shape is not static. First, code for insert allocate for Conv is wrong: When the shape is not constant, the input shape is always copied. We need to use IndexExpr to calculate the output shape. @tungld, could you fix this issue since you wrote the shape inference for Conv. @AlexandreEichenberger, could you help Tung in using IndexExpr if needed?
@chentong319 @AlexandreEichenberger I used IndexExpr to infer shape for Conv. PR #764 is ready to review now.
I tested with Yolo. Inference in case of dynamic dims took 79 seconds on my local machine. Memory consumption was about 1.7 GB. We still need a better buffer management for dynamic tensors in the future.
@tungld so you verified the output in the dyn case?
@tungld so you verified the output in the dyn case?
I only measured the inference time and memory consumption in the dynamic case, but did not verify the output result since I didn't have the reference output for the model.
@tungld Where should I put the expect output? Is your script good enough to be put into git repository? I ran yolo with the current master branch and got the same execution time as the static version.
@chentong319 how about the same folder you put the yolo model?
Is your script good enough to be put into git repository?
It's a small python script. I will polish it a bit then put it into the git repos.
Compiling with --mtriple=s390x-ibm-linux --mcpu=z14
helps reduce the inference time from 80s to 54s.
@tungld Cool! Do we know what kind optimization is enabled by these flags? simdization?
Its worst than no simd and no unrolling. It results in performance that is akin no optimizations, see GEMM example below
@chentong319 maybe just post the latest time with static vs dynamic, and if it is satisfactory, let's close this one. Working on making conv go faster, this is a separate issue.
The argument tensor of yolo model is of shape tensor<?x?x?x3xf32> (called dynamic version here). If we change the shape of the argument with the shape of the input tensor used for inference, tensor<1x416x416x3xf32>, the performance (called static version) is much better (more than x100). I did so investigation into this issue. First, I added instrumentation for the popular op in yolo model: Conv and Mul. Here are the observations:
First difference between dynamic and static version is memory usage. In the static version, the virtual memory size is around 400 m for different versions of model (I tried to remove some operations). For dynamic model, memory usage increases with the number of ops. With only 1/3 of the operations in the original model, the memory size is 22 G. I am not clear about the impact on performance, cold cache misses, unnecessary cache write back and even extra page fault. Can we run some tool to get the details?
For the operations, now I focused on Mul, which is quite simple. For the EmitLLVMIR, the major difference is that the extra get dim size operations inside loop body. If the compiler does not know the alias info, it is possible that these operations are not moved to out side of loop, introducing extra load operations. In next step, we need to check the bc file to verify what the final code is. How to generate bc file?