Open HectorSVC opened 5 days ago
Regarding #1, I will note in the readme:
This benchmark is designed to resemble some real world models we depend on
Regarding #2, Whisper (and most other other models) doesn't run the same matrix multiplication over and over again. Instead it runs a bunch of different (large) multiplications in a row. This tends to push weights out of cache, and as such I'd argue that cold-cache performance for a single layer's operations is, if anything, more important than warm-cache performance.
Does your real word models have same IO size? It doesn't make sense that just extract part of the model and test it separately. It makes more sense to test a full model instead.
Also the benchmark script compare QDQ model on NPU vs fp32 model on CPU, it's not apple to apple.
There's useless DQ node in matmul_model_quant_io.onnx
Also have some questions: