LirMatMulUnary slow? - Githubissues

sonos / tract

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Other

2.21k stars 214 forks source link

LirMatMulUnary slow? #1035

Closed stellanhaglund closed 1 year ago

stellanhaglund commented 1 year ago

Hi I'm trying to run a custom pytorch vision transformer model, or an audio spectrogram transformer more specifically and I was able to get everything working but the problem is that the inference when I ran it on cpu in pytorch was about 100ms but in tract i get about 450ms.

I ran it through the tract cli profile and got this

 * LirMatMulUnary        50 nodes: 332.091 ms/i 76.4%
 * Mul                  112 nodes:  20.345 ms/i  4.7%
 * MatMul                24 nodes:  20.088 ms/i  4.6%
 * Softmax               12 nodes:  18.506 ms/i  4.3%
 * Add                   63 nodes:   9.814 ms/i  2.3%
 * MatMatMulPack         49 nodes:   7.722 ms/i  1.8%

From that it looks like it could be LirMatMulUnary that's slowing it down. Is there anything I can do about this?

Right now I'm running it on a M2 but I'm hoping to be able to run it on mobile devices as well.

stellanhaglund commented 1 year ago

Maybe an option could be to replace the parts in the model that makes it use this operation. do you have any idea which torch operation it is?

kali commented 1 year ago

LirMatMulUnary is tract matrix multiplier. All affine operations (convolutions, matmul, ... ) are ultimately lowered to this specific operator. It is expected that a neural network spend most of its time doing matrix products. I would actually say the proportion feels low, it is often in the ~90% neighborhood. Looking for a replacement is not the right path.

Can you share a bit more ? Are there specific instances of MatMuls that are under-performing ? --cost --profile gives you a cost and velocity indication (in GFlops/s).

tract is relatively efficient on M1 and M2. But tract uses a single CPU core to run the inference, while pytorch and other frameworks may make use of multiple cores (plus a gpu sometimes).

You could also try the main branch. A lot of effort has been put recently on making the optimizer more robust across network architectures. It's not fully baked yet, but it may be interesting.

stellanhaglund commented 1 year ago

Okay I see! I will try that out, I guess I have to go with a smaller model for this to be feasible. Will try this one! https://github.com/chinhsuanwu/mobilevit-pytorch

kali commented 1 year ago

Closing here, reopen or create a new issue if needed.