Open jesse-lemurian opened 1 year ago
I'll attach the full profile I took screenshots of, which can be unzipped, then viewed at https://profiler.firefox.com/ test.zip
Are you interested in contributing optimizations to make CONV faster? It needs SIMD and tiling. I started looking at a promising approach that uses a NCHWxC data layout, where basically some of the channels C are used at the innermost of the image so as to easily generate SIMD code. The data transform operations have been implemented.
Yes, I am interested in contributing optimizations to make CONV faster.
Could you link me to some relevant code that illustrates what a NCHWxC data layout looks like? I'm having trouble visualizing what that means in practice. Links to the 'data transform ops' would also be appreciated.
I'm assuming by 'tiling' you mean processing the inputs to the op in cache-size-friendly blocks? I've done this in another project. It's fiddly, but I could probably make it happen.
Do you have any notion on how long implementing either SIMD or tiling might take a first-time contributor? I'm fairly familiar with tiling (assuming we're talking about the same thing), SIMD and linear algebra, but only passingly familiar with LLVM and MLIR. My background is game engines, if that provides any more color.
In any case thanks for the help :)
https://www.usenix.org/conference/atc19/presentation/liu-yizhi https://www.usenix.org/system/files/atc19-liu-yizhi.pdf Optimizing CNN Model Inference on CPUs
I have similar problem in LLM model inference, the time is too much slow even through I use the -parallel
option.
I wonder if there is a way to add debuginfo in the lib*.so file, so that I can use perf to monitor which operator is the bottleneck?
The Cons is still not optimized, it is a reference implementation, except for the 1x1 Conv which is implemented by the matmul.
Parallel helps a bit, but it is still not blocked, and not simdized.
Hi onnx-mlir folks! I'm interested in learning more about what might make ResNet run inference faster. I'm currently doing some numerical analysis and the runtime is starting to become a problem for me.
I've read through the instrumentation guidelines, #753, and #1101
Currently, I'm getting ~2 minutes per inference with ResNet50 on a 2.0ghz EPYC chip. I'm under the impression that a reasonable inference time to expect on this chip might be a fair bit faster:
https://www.researchgate.net/figure/Inference-time-and-acceleration-for-ResNet-50-on-different-platforms_tbl2_343626375
I've compiled all
onnx-mlir
dependencies, thePyRuntime
, and the onnx model in-O3
.After some preliminary profiling, I have been able to discern the majority of the program runtime is spent in
libm
, which is probably expected.For a bit of context, the experiments I'm running are 'hijacking' MatMul and Gemm ops via an Accelerator we wrote, which get indirected at runtime to our own implementations. Confusingly, the routine we're calling is named
conv2d
, but it's doing a matrix multiply. I inherited this codebase and some of the names are wonky. You can see theconv2d
call in the profile above takes 16% of the program runtime.Anyhow, at a more granular level (of a few hundred milliseconds), this is what the profile looks like:
It clearly shows that whatever is happening inside
libm
dominates the runtime (in those [unknown] blocks), and our littleconv2d
routine is minor by comparison.So, I suppose I have a few questions:
How might I find out what
libm
is doing? I could always recompile it with debug symbols, but I wanted to ask if there are any tools inonnx-mlir
for doing this type of performance analysis, or if anyone has any helpful suggestions before I go there.Additionally, in #1101 @AlexandreEichenberger mentioned that "we currently don't optimize conv ops"? Would someone be able to give me a bit more detail on what that means specifically?
Finally, if my assessment is correct (ie. there truly are ~1.5 orders of magnitude of runtime on the table), do you think it would be a reasonable ask for a first-time contributor to onnx-mlir (such as myself) to make a sizable dent in that?
What I have not yet tried:
Actually instrumenting the model. While this might be useful in diagnosing the operations it's spending all of the time in, I don't think it tells me much about how to solve the problem.
Experimenting with the suggestions in #753. Given that the runtime I'm seeing is roughly in-line with the runtimes reported in that issue, I have a hunch I may be hitting a similar, or the same, issue.
If these seem like worthwhile activities, or anyone has pointers/ideas about other data points that might be useful, I'm all ears.