onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
775 stars 323 forks source link

Performance Debugging #2166

Open jesse-lemurian opened 1 year ago

jesse-lemurian commented 1 year ago

Hi onnx-mlir folks! I'm interested in learning more about what might make ResNet run inference faster. I'm currently doing some numerical analysis and the runtime is starting to become a problem for me.

I've read through the instrumentation guidelines, #753, and #1101

Currently, I'm getting ~2 minutes per inference with ResNet50 on a 2.0ghz EPYC chip. I'm under the impression that a reasonable inference time to expect on this chip might be a fair bit faster:

https://www.researchgate.net/figure/Inference-time-and-acceleration-for-ResNet-50-on-different-platforms_tbl2_343626375

I've compiled all onnx-mlir dependencies, the PyRuntime, and the onnx model in -O3.

After some preliminary profiling, I have been able to discern the majority of the program runtime is spent in libm, which is probably expected.

aggregate

For a bit of context, the experiments I'm running are 'hijacking' MatMul and Gemm ops via an Accelerator we wrote, which get indirected at runtime to our own implementations. Confusingly, the routine we're calling is named conv2d, but it's doing a matrix multiply. I inherited this codebase and some of the names are wonky. You can see the conv2d call in the profile above takes 16% of the program runtime.

Anyhow, at a more granular level (of a few hundred milliseconds), this is what the profile looks like:

interleaved

It clearly shows that whatever is happening inside libm dominates the runtime (in those [unknown] blocks), and our little conv2d routine is minor by comparison.

So, I suppose I have a few questions:

How might I find out what libm is doing? I could always recompile it with debug symbols, but I wanted to ask if there are any tools in onnx-mlir for doing this type of performance analysis, or if anyone has any helpful suggestions before I go there.

Additionally, in #1101 @AlexandreEichenberger mentioned that "we currently don't optimize conv ops"? Would someone be able to give me a bit more detail on what that means specifically?

Finally, if my assessment is correct (ie. there truly are ~1.5 orders of magnitude of runtime on the table), do you think it would be a reasonable ask for a first-time contributor to onnx-mlir (such as myself) to make a sizable dent in that?

What I have not yet tried:

If these seem like worthwhile activities, or anyone has pointers/ideas about other data points that might be useful, I'm all ears.

jesse-lemurian commented 1 year ago

I'll attach the full profile I took screenshots of, which can be unzipped, then viewed at https://profiler.firefox.com/ test.zip

AlexandreEichenberger commented 1 year ago

Are you interested in contributing optimizations to make CONV faster? It needs SIMD and tiling. I started looking at a promising approach that uses a NCHWxC data layout, where basically some of the channels C are used at the innermost of the image so as to easily generate SIMD code. The data transform operations have been implemented.

jesse-lemurian commented 1 year ago

Yes, I am interested in contributing optimizations to make CONV faster.

Could you link me to some relevant code that illustrates what a NCHWxC data layout looks like? I'm having trouble visualizing what that means in practice. Links to the 'data transform ops' would also be appreciated.

I'm assuming by 'tiling' you mean processing the inputs to the op in cache-size-friendly blocks? I've done this in another project. It's fiddly, but I could probably make it happen.

Do you have any notion on how long implementing either SIMD or tiling might take a first-time contributor? I'm fairly familiar with tiling (assuming we're talking about the same thing), SIMD and linear algebra, but only passingly familiar with LLVM and MLIR. My background is game engines, if that provides any more color.

In any case thanks for the help :)

AlexandreEichenberger commented 1 year ago

https://www.usenix.org/conference/atc19/presentation/liu-yizhi https://www.usenix.org/system/files/atc19-liu-yizhi.pdf Optimizing CNN Model Inference on CPUs

hunterzju commented 1 year ago

I have similar problem in LLM model inference, the time is too much slow even through I use the -parallel option. I wonder if there is a way to add debuginfo in the lib*.so file, so that I can use perf to monitor which operator is the bottleneck?

AlexandreEichenberger commented 1 year ago

The Cons is still not optimized, it is a reference implementation, except for the 1x1 Conv which is implemented by the matmul.

Parallel helps a bit, but it is still not blocked, and not simdized.