onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
767 stars 320 forks source link

Performance issue for yolo model with unknown size #753

Open chentong319 opened 3 years ago

chentong319 commented 3 years ago

The argument tensor of yolo model is of shape tensor<?x?x?x3xf32> (called dynamic version here). If we change the shape of the argument with the shape of the input tensor used for inference, tensor<1x416x416x3xf32>, the performance (called static version) is much better (more than x100). I did so investigation into this issue. First, I added instrumentation for the popular op in yolo model: Conv and Mul. Here are the observations:

  1. Instrumentation overhead is small enough for this investigation.
  2. No significant variation of execution time for each op from run to run.
  3. For the first 6 instrumented op, (in sequence of Conv, Mul, Conv, Mul, Conv, Mul), slow down is 0.99, 1.08, 0.75, 7.85, 6.19, 6.26. In this test, the model is trimmed down to the third Mul op.
  4. For longer sequence of ops, the performance of the first 6 remains the same. The size of tensor gradually decrease and the slowdown increases to 100x

First difference between dynamic and static version is memory usage. In the static version, the virtual memory size is around 400 m for different versions of model (I tried to remove some operations). For dynamic model, memory usage increases with the number of ops. With only 1/3 of the operations in the original model, the memory size is 22 G. I am not clear about the impact on performance, cold cache misses, unnecessary cache write back and even extra page fault. Can we run some tool to get the details?

For the operations, now I focused on Mul, which is quite simple. For the EmitLLVMIR, the major difference is that the extra get dim size operations inside loop body. If the compiler does not know the alias info, it is possible that these operations are not moved to out side of loop, introducing extra load operations. In next step, we need to check the bc file to verify what the final code is. How to generate bc file?

AlexandreEichenberger commented 3 years ago

@chentong319 Thanks for this very detailed investigation.

Just to be clear, the slowdown above are using the current implementation with Dynamic vs Static vector shape, right?

It would also be interesting to learn why the later ops get increasingly slower? Are they increasingly larger?

From having looked at assembly, even the simplest loop:

for(int I=0; I<64; I++) a[I] = b[I] + c[I]

If unrolled by the backend, there is literally no scheduling of the reads (b and c) past any write (a) because memory disambiguation is quite broken once we are out of MLIR. For this particular example, the only way to move read past write operations was to hav MLIR unroll the loop.

With static sizes, we agree that memory bundling can reuse the memory used for temp tensors quite well. Say this is hard for arbitrary dynamic sizes, the cost of having malloc/free just before every op should technically have nearly the same footprint as the static size (at least same order of magnitude) and I don't believe the overhead associated with malloc/free should be too high if the amount of computations (such as matmul/conv) are nontrivial, as computations are at least O(N3) compared to O(1) of malloc/free costs.

Right now, our policy is to place malloc/free at beginning/end of function. That could be parametrized to before/after op in presence of dynamic tensors. Or malloc/free could be moved toward begin/end of functions vs begin/end of operation depending on static/dynamic tensors.

tungld commented 3 years ago

Hi Tong,

For the operations, now I focused on Mul, which is quite simple.

For unknown dims, Mul needs more operations to check broadcasting.

In next step, we need to check the bc file to verify what the final code is. How to generate bc file?

There are two ways:

  1. Invoke onnx-mlir test.onnx --preserveBitcode, you will get Bitcode files with and without optimizations.
  2. Generate it starting from the output of EmitLLVMIR:
    $ mlir-translate test.mlir -mlir-to-llvmir -o test.ll
    $ llvm-as test.ll –o test.bc

But you cannot read test.bc since it is not a human-readable file. To get a human-readable output from the bitcode, use llvm-dis test.bc -o test.dis

mlir-translate, llvm-as and llvm-dis are tools in LLVM (in the same folder as FileCheck).

chentong319 commented 3 years ago

@tungld Thanks. I used --preserveBitcode flag and llv-dis to get the .ll code. My previous concern about the load for get dim for dynamic version is not an issue. llvm moved that load out of loop. The check of broadcast is there for the dynamic. But the part for load and multiplication is the same in both dynamic and static version. For this model, if mlir provide support for symbolic dim size, the broadcast can be eliminated. The shape of the two inputs come from the same tensor.

chentong319 commented 3 years ago

One concern for dynamic tensor is that their space can not be reused. If we bundle them, the total space might be huge. When a memref is generated, memref.alloc is inserted. However, the dealloc is not added, unless the output is not used. In general, the dealloc of memref should be added at the end of of its live range, which can not be done locally with one op. Currently, we only have one free at the end of the program for the bundled space. We need to enhance the memory management code for dynamic shapes.

AlexandreEichenberger commented 3 years ago

For unknown dims, Mul needs more operations to check broadcasting.

@tungld Interesting: that is if we have the ranks, we still need to test if the dim is != 1 as this would still force broadcasting, isn't it? This speak in favor of a N>1 dynamic number, as, for example, a model would typically know that the "sizes" of an image would be runtime but would not invoke "broadcasting" of pixels among there runtime image size, if you see what I mean...

tungld commented 3 years ago

@tungld Interesting: that is if we have the ranks, we still need to test if the dim is != 1 as this would still force broadcasting, isn't it?

Right.

This speak in favor of a N>1 dynamic number, as, for example, a model would typically know that the "sizes" of an image would be runtime but would not invoke "broadcasting" of pixels among there runtime image size, if you see what I mean...

I see. we can make the handling fo broadcasting a bit smarter here.

In general, the dealloc of memref should be added at the end of of its live range, which can not be done locally with one op.

@chentong319 if you want to quickly test this to see any performance change, you can use a MLIR pass, buffer-deallocation. It automatically inserts dealloc just at the end of the live range.

AlexandreEichenberger commented 3 years ago

@tungld thanks for the suggestion, let try existing code first before modifying our own.

@chentong319 if you want to quickly test this to see any performance change, you can use a MLIR pass, buffer-deallocation. It automatically inserts dealloc just at the end of the live range.

chentong319 commented 3 years ago

@tungld Excellent suggestion. I did the experiment in the following steps:

# v6.mlir is a dynamic version with 25 Conv
Debug/bin/onnx-mlir --EmitONNXIR v6.mlir
cp v6.onnx.mlir v6-dealloc-1.mlir
# Skip our own buffer related transformation
Debug/bin/onnx-mlir-opt --convert-onnx-to-krnl v6-dealloc-1.mlir > v6-dealloc-2.mlir
Debug/bin/onnx-mlir-opt -canonicalize --convert-krnl-to-affine v6-dealloc-2.mlir > v6-dealloc-3.mlir

#delete the our own dealloc before return
vi v6-dealloc-3.mlir
~/llvm-project/build/bin/mlir-opt -allow-unregistered-dialect --buffer-deallocation v6-dealloc-3.mlir > v6-dealloc-4.mlir
Debug/bin/onnx-mlir --preserveBitcode v6-dealloc-4.mlir

The result is

The slowdown came from other factor. I will check the impact if code is generated with no broadcast assumption.

AlexandreEichenberger commented 3 years ago

@chentong319 great, looking forward to hear what we get now. This option can be run in all cases, after bundling terminates, in case there are un-bundled malloc/free

chentong319 commented 3 years ago

Some correction on the result using mlir buffer-deallocation pass. Even though there is no significant change in overall execution time, there is notable change in execution time for onnx.Mul op. I list the comparison of execution time for the first 22 Mul operations. The baseline the dynamic version compiled with onnx-mlir.

onnx.mul vs. use dealloc vs. static
1 1.118986 1.40364468
2 1.14244059 7.32108377
3 0.81698702 5.14030296
4 0.91209709 6.43561899
5 1.27468225 6.96805211
6 1.34997306 9.07578612
7 1.74610699 11.1936703
8 1.36749926 42.7274698
9 1.49653903 28.9776616
10 1.66238222 39.7137453
11 1.45076816 39.4212524
12 1.48723424 40.8446663
13 1.88158898 42.286802
14 1.38741428 37.8878981
15 1.51278562 40.3782091
16 1.53740106 44.6491948
17 1.45878552 43.0196265
18 1.51503751 174.397863
19 1.55616374 181.753731
20 1.57064024 172.387887
21 1.50640728 180.047742
22 1.72477148 128.940577
tungld commented 3 years ago

Inspired by this issue, I created a PR #758 to allow us passing static shapes to a dynamic model to avoid performance overhead. It may be used to test static/dynamic models.

chentong319 commented 3 years ago

I further investigated the impact of broadcast and found that the code for possible broadcasting is NOT the major source of slowdown. Execution time for onnx.Mul is reduce only about 15% on average. With dynamic shape, whether broadcasting occurs or not can not be determined at compile time. onnx-mlir generates code with selection operations. However, there is no broadcast occur in this model. We cab analyze where the shapes of two tensor are the same through data-flow. I simply changed the code in the investigation.

To me, the differences of ll file for dynamic without broadcasting and the one for static shape are on loop bound only. The static shape is 1x32x208x208 while the dynamic shape is ?x32x?x?. Loop for 1 is removed and bound check is moved to the end of the body for the constant upper bound. It seems to me that we have to look into what further optimization from the bitcode to the binary.

The .ll file for static shape is given below

  br label %.preheader27, !dbg !176

.preheader27:                                     ; preds = %.preheader28, %465
  %446 = phi i64 [ 0, %.preheader28 ], [ %466, %465 ]
  %447 = mul nuw nsw i64 %446, 43264
  br label %.preheader, !dbg !177

.preheader:                                       ; preds = %.preheader27, %462
  %448 = phi i64 [ 0, %.preheader27 ], [ %463, %462 ]
  %449 = mul nuw nsw i64 %448, 208
  %450 = add nuw nsw i64 %447, %449
  br label %451, !dbg !178

451:                                              ; preds = %.preheader, %451
  %452 = phi i64 [ 0, %.preheader ], [ %460, %451 ]
  %453 = add nuw nsw i64 %450, %452, !dbg !179
  %454 = getelementptr float, float* %53, i64 %453, !dbg !179
  %455 = load float, float* %454, align 4, !dbg !179
  %456 = getelementptr float, float* %51, i64 %453, !dbg !180
  %457 = load float, float* %456, align 4, !dbg !180
  %458 = fmul float %455, %457, !dbg !181
  %459 = getelementptr float, float* %50, i64 %453, !dbg !182
  store float %458, float* %459, align 4, !dbg !182
  %460 = add nuw nsw i64 %452, 1, !dbg !178
  %461 = icmp ult i64 %452, 207, !dbg !178
  br i1 %461, label %451, label %462, !dbg !178

462:                                              ; preds = %451
  %463 = add nuw nsw i64 %448, 1, !dbg !177
  %464 = icmp ult i64 %448, 207, !dbg !177
  br i1 %464, label %.preheader, label %465, !dbg !177

465:                                              ; preds = %462
  %466 = add nuw nsw i64 %446, 1, !dbg !176
  %467 = icmp ult i64 %446, 31, !dbg !176
  br i1 %467, label %.preheader27, label %468, !dbg !176

468: 

For dynamic shape without broadcast is given below

  %548 = tail call i8* @malloc(i64 %57), !dbg !188
  %549 = bitcast i8* %548 to float*, !dbg !188
  br i1 %20, label %.preheader30, label %._crit_edge73, !dbg !189

.preheader30:                                     ; preds = %._crit_edge83, %.us-lcssa.us
  %550 = phi i64 [ %573, %.us-lcssa.us ], [ 0, %._crit_edge83 ]
  %551 = mul i64 %550, %54
  br i1 %60, label %.preheader29.us, label %.us-lcssa.us, !dbg !190

.preheader29.us:                                  ; preds = %.preheader30, %._crit_edge67.us
  %552 = phi i64 [ %555, %._crit_edge67.us ], [ 0, %.preheader30 ]
  %553 = mul i64 %552, %12
  %554 = add i64 %553, %551
  br i1 %19, label %.preheader.us.us, label %._crit_edge67.us, !dbg !191

._crit_edge67.us:                                 ; preds = %._crit_edge.us.us, %.preheader29.us
  %555 = add nuw nsw i64 %552, 1, !dbg !190
  %556 = icmp ult i64 %552, 31, !dbg !190
  br i1 %556, label %.preheader29.us, label %.us-lcssa.us, !dbg !190

.preheader.us.us:                                 ; preds = %.preheader29.us, %._crit_edge.us.us
  %557 = phi i64 [ %560, %._crit_edge.us.us ], [ 0, %.preheader29.us ]
  %558 = mul i64 %557, %5
  %559 = add i64 %554, %558
  br label %562, !dbg !192

._crit_edge.us.us:                                ; preds = %562
  %560 = add nuw nsw i64 %557, 1, !dbg !191
  %561 = icmp slt i64 %560, %4, !dbg !191
  br i1 %561, label %.preheader.us.us, label %._crit_edge67.us, !dbg !191

562:                                              ; preds = %562, %.preheader.us.us
  %563 = phi i64 [ 0, %.preheader.us.us ], [ %571, %562 ]
  %564 = add i64 %559, %563, !dbg !193
  %565 = getelementptr float, float* %430, i64 %564, !dbg !193
  %566 = load float, float* %565, align 4, !dbg !193
  %567 = getelementptr float, float* %546, i64 %564, !dbg !194
  %568 = load float, float* %567, align 4, !dbg !194
  %569 = fmul float %566, %568, !dbg !195
  %570 = getelementptr float, float* %549, i64 %564, !dbg !196
  store float %569, float* %570, align 4, !dbg !196
  %571 = add nuw nsw i64 %563, 1, !dbg !192
  %572 = icmp slt i64 %571, %5, !dbg !192
  br i1 %572, label %562, label %._crit_edge.us.us, !dbg !192

.us-lcssa.us:                                     ; preds = %._crit_edge67.us, %.preheader30
  %573 = add nuw nsw i64 %550, 1, !dbg !189
  %574 = icmp slt i64 %573, %3, !dbg !189
  br i1 %574, label %.preheader30, label %._crit_edge73, !dbg !189

._crit_edge73:                                    ; preds = %.us-lcssa.us, %._crit_edge83
chentong319 commented 3 years ago

I checked the binary with objdump. Similar to the .ll file, the innermost loop body is the same while the loop structure is slightly different. I cannot figure out why the performance can differ by 4x. I used this link for assembly reference. The binary for the third onnx.Mul is list here. The offset of main_graph is 6eb0. I annotated some label in the objdump result to show the loop structure. For static tensor:

    7b6e:   c0 e5 ff ff f2 15   brasl   %r14,5f98 <InstrumentEntryPoint@plt>
    7b74:   a7 19 00 00         lghi    %r1,0
    7b78:   a7 29 00 00         lghi    %r2,0
    7b7c:   e3 d0 f0 a0 00 04   lg  %r13,160(%r15)
    7b82:   e3 c0 f1 18 00 04   lg  %r12,280(%r15)
    7b88:   e3 b0 f1 48 00 04   lg  %r11,328(%r15)
    7b8e:   b9 04 00 31         lgr %r3,%r1
    7b92:   a7 49 00 00         lghi    %r4,0
    7b96:   a7 59 ff ff         lghi    %r5,-1    //<main_graph+0xce6>
    7b9a:   b9 04 00 e3         lgr %r14,%r3

   <main_graph+0xcee>:  // innermost loop
    7b9e:   78 0e b0 00         le  %f0,0(%r14,%r11) 
    7ba2:   ed 0e c0 00 00 17   meeb    %f0,0(%r14,%r12)
    7ba8:   70 0e d0 00         ste %f0,0(%r14,%r13)
    7bac:   41 50 50 01         la  %r5,1(%r5)
    7bb0:   41 e0 e0 04         la  %r14,4(%r14)
    7bb4:   ec 54 ff f5 cf 7d   clgijl  %r5,207,7b9e <main_graph+0xcee>

    7bba:   c2 4e 00 00 00 cf   clgfi   %r4,207
    7bc0:   41 40 40 01         la  %r4,1(%r4)
    7bc4:   41 30 33 40         la  %r3,832(%r3)
    7bc8:   a7 44 ff e7         jl  7b96 <main_graph+0xce6>
    7bcc:   c2 2e 00 00 00 1f   clgfi   %r2,31
    7bd2:   41 20 20 01         la  %r2,1(%r2)
    7bd6:   e3 10 14 00 2a 71   lay %r1,173056(%r1)
    7bdc:   a7 44 ff d9         jl  7b8e <main_graph+0xcde>
    7be0:   b9 04 00 2b         lgr %r2,%r11
    7be4:   c0 e5 ff ff f8 3a   brasl   %r14,6c58 <free@plt>
    7bea:   b9 04 00 2c         lgr %r2,%r12
    7bee:   c0 e5 ff ff f8 35   brasl   %r14,6c58 <free@plt>
    7bf4:   a7 29 00 02         lghi    %r2,2
    7bf8:   a7 39 00 00         lghi    %r3,0
    7bfc:   c0 e5 ff ff f1 ce   brasl   %r14,5f98 <InstrumentEntryPoint@plt>

For dynamic tensors:

    86f4:   c0 e5 ff ff ec 52   brasl   %r14,5f98 <InstrumentEntryPoint@plt>
    86fa:   b9 04 00 2d         lgr %r2,%r13
    86fe:   c0 e5 ff ff f0 3d   brasl   %r14,6778 <malloc@plt>
    8704:   e3 20 f1 58 00 24   stg %r2,344(%r15)
    870a:   e5 58 f0 c8 00 00   cghsi   200(%r15),0
    8710:   a7 c4 00 90         jle 8830 <main_graph+0x1980>
    8714:   e3 10 f0 d8 00 04   lg  %r1,216(%r15)
    871a:   eb 01 00 07 00 0d   sllg    %r0,%r1,7
    8720:   e3 00 f1 60 00 24   stg %r0,352(%r15)
    8726:   eb 11 00 02 00 0d   sllg    %r1,%r1,2
    872c:   e3 20 f1 80 00 04   lg  %r2,384(%r15)
    8732:   eb 22 00 02 00 0d   sllg    %r2,%r2,2
    8738:   a7 59 00 00         lghi    %r5,0
    873c:   e3 80 f1 70 00 24   stg %r8,368(%r15)
    8742:   e3 00 f1 58 00 04   lg  %r0,344(%r15)
    8748:   a7 f4 00 24         j   8790 <main_graph+0x18e0>
    874c:   e3 50 f1 68 00 04   lg  %r5,360(%r15)
    8752:   41 50 50 01         la  %r5,1(%r5)
    8756:   e3 40 f1 60 00 04   lg  %r4,352(%r15)
    875c:   b9 08 00 04         agr %r0,%r4
    8760:   e3 30 f1 78 00 04   lg  %r3,376(%r15)
    8766:   b9 08 00 34         agr %r3,%r4
    876a:   e3 30 f1 78 00 24   stg %r3,376(%r15)
    8770:   e3 30 f1 70 00 04   lg  %r3,368(%r15)
    8776:   b9 08 00 34         agr %r3,%r4
    877a:   e3 30 f1 70 00 24   stg %r3,368(%r15)
    8780:   e3 50 f0 c8 00 20   cg  %r5,200(%r15)
    8786:   e3 70 f1 40 00 04   lg  %r7,320(%r15)
    878c:   a7 a4 00 52         jhe 8830 <main_graph+0x1980>
    8790:   e3 50 f1 68 00 24   stg %r5,360(%r15)
    8796:   ec 7c ff db 00 7c   cgijnh  %r7,0,874c <main_graph+0x189c>
    879c:   a7 99 00 00         lghi    %r9,0
    87a0:   e3 e0 f1 70 00 04   lg  %r14,368(%r15)
    87a6:   e3 d0 f1 78 00 04   lg  %r13,376(%r15)
    87ac:   b9 04 00 50         lgr %r5,%r0
    87b0:   a7 f4 00 0f         j   87ce <main_graph+0x191e>
    87b4:   b9 08 00 51         agr %r5,%r1
    87b8:   b9 08 00 d1         agr %r13,%r1
    87bc:   b9 08 00 e1         agr %r14,%r1
    87c0:   c2 9e 00 00 00 1f   clgfi   %r9,31
    87c6:   41 90 90 01         la  %r9,1(%r9)
    87ca:   a7 a4 ff c1         jhe 874c <main_graph+0x189c>
    87ce:   e5 58 f1 80 00 00   cghsi   384(%r15),0
    87d4:   a7 c4 ff f0         jle 87b4 <main_graph+0x1904>
    87d8:   a7 79 00 00         lghi    %r7,0
    87dc:   b9 04 00 6e         lgr %r6,%r14
    87e0:   b9 04 00 bd         lgr %r11,%r13
    87e4:   b9 04 00 a5         lgr %r10,%r5
    87e8:   a7 c9 00 00         lghi    %r12,0
    87ec:   a7 49 00 00         lghi    %r4,0
    87f0:   e3 30 f1 80 00 04   lg  %r3,384(%r15)

   <main_graph+0x1946>: // innermost loop
    87f6:   78 0c 60 00         le  %f0,0(%r12,%r6)
    87fa:   ed 0c b0 00 00 17   meeb    %f0,0(%r12,%r11)
    8800:   70 0c a0 00         ste %f0,0(%r12,%r10)
    8804:   41 40 40 01         la  %r4,1(%r4)
    8808:   41 c0 c0 04         la  %r12,4(%r12)
    880c:   ec 43 ff f5 40 64   cgrjl   %r4,%r3,87f6 <main_graph+0x1946>

    8812:   41 70 70 01         la  %r7,1(%r7)
    8816:   b9 08 00 a2         agr %r10,%r2
    881a:   b9 08 00 b2         agr %r11,%r2
    881e:   b9 08 00 62         agr %r6,%r2
    8822:   e3 70 f1 40 00 20   cg  %r7,320(%r15)
    8828:   a7 44 ff e0         jl  87e8 <main_graph+0x1938>
    882c:   a7 f4 ff c4         j   87b4 <main_graph+0x1904>
    8830:   e3 20 f1 38 00 04   lg  %r2,312(%r15)
    8836:   c0 e5 ff ff f2 11   brasl   %r14,6c58 <free@plt>
    883c:   b9 04 00 28         lgr %r2,%r8
    8840:   c0 e5 ff ff f2 0c   brasl   %r14,6c58 <free@plt>
    8846:   a7 29 00 02         lghi    %r2,2
    884a:   a7 39 00 00         lghi    %r3,0
    884e:   c0 e5 ff ff eb a5   brasl   %r14,5f98 <InstrumentEntryPoint@plt>
tungld commented 3 years ago

Tong, I checked with ResNet50 and saw the same issue for dynamic dimensions, but we had a good news at least.

So I suppose that there was something wrong with the lowering for Conv, in particular, to handle dynamic dimensions for the image height and width. Perhaps, it wrongly computed loop bounds (two big values would cause an infinite run, or invalid values would cause segfault).

Our end-to-end test, which is make check-onnx-backend-dynamic, only checks dynamic dimension for batch size, not height or width. So it did not catch this Yolo situation.

Given that assumption, I edited the numerical test for Conv to check dynamic image height, and it seems we got problems there. See PR #764. I believe we should fix the issue with Conv before going further debugging Yolo.

chentong319 commented 3 years ago

I think I found at least one issue: the shape inference for Conv is wrong when the shape is not static. First, code for insert allocate for Conv is wrong: When the shape is not constant, the input shape is always copied. We need to use IndexExpr to calculate the output shape. @tungld, could you fix this issue since you wrote the shape inference for Conv. @AlexandreEichenberger, could you help Tung in using IndexExpr if needed?

tungld commented 3 years ago

@chentong319 @AlexandreEichenberger I used IndexExpr to infer shape for Conv. PR #764 is ready to review now.

I tested with Yolo. Inference in case of dynamic dims took 79 seconds on my local machine. Memory consumption was about 1.7 GB. We still need a better buffer management for dynamic tensors in the future.

AlexandreEichenberger commented 3 years ago

@tungld so you verified the output in the dyn case?

tungld commented 3 years ago

@tungld so you verified the output in the dyn case?

I only measured the inference time and memory consumption in the dynamic case, but did not verify the output result since I didn't have the reference output for the model.

chentong319 commented 3 years ago

@tungld Where should I put the expect output? Is your script good enough to be put into git repository? I ran yolo with the current master branch and got the same execution time as the static version.

tungld commented 3 years ago

@chentong319 how about the same folder you put the yolo model?

Is your script good enough to be put into git repository?

It's a small python script. I will polish it a bit then put it into the git repos.

tungld commented 3 years ago

Compiling with --mtriple=s390x-ibm-linux --mcpu=z14 helps reduce the inference time from 80s to 54s.

chentong319 commented 3 years ago

@tungld Cool! Do we know what kind optimization is enabled by these flags? simdization?

AlexandreEichenberger commented 3 years ago

Its worst than no simd and no unrolling. It results in performance that is akin no optimizations, see GEMM example below

image

AlexandreEichenberger commented 3 years ago

@chentong319 maybe just post the latest time with static vs dynamic, and if it is satisfactory, let's close this one. Working on making conv go faster, this is a separate issue.