bf16 matmul's corresponding `tensor.pack` not properly optimized

yifeizh2 commented 1 month ago

Currently, the following 2 single-layer MLP have worst performance compared with GC v1. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

dtype | batch size | hidden list | GC V1 | 8c55a0544 remove brgemm read lock -- | -- | -- | -- | -- bf16 | 128 | 1024x1024 | 0.0286 | 0.0828 | 34.52% bf16 | 128 | 1024x512 | 0.0204 | 0.0670 | 30.45%

We performed detailed breakdown as follows:

128x1024x1024 | GC v1 | 8c55a0544 -- | -- | -- matmul only | 0.01766 | 0.01989 tiled pack (or reorder) | 0.02634 | 0.04632 total | 0.04418 | 0.077969

and

128x1024x512 | GC v1 | 8c55a0544 -- | -- | -- matmul only | 0.01587 | 0.01591 tiled pack (or reorder) | 0.01278 | 0.0398 total | 0.02881 | 0.06917

Are there any further optimization opportunity for vnni pack?

BRUCE11111 commented 1 month ago

VNNI reorder will be included in my to-do list. However, the current priority is to merge the physical register pass and the corresponding vector-based op fusion under static shape into master as soon as possible (within two weeks). Then support dynamic shape for the sake of another issue, and then optimize the instruction level of specific op like vnni reorder. I can switch priorities if there is a more urgent need.

ZhennanQin commented 1 month ago

I guess those VNNI reorder can be folded out if we have constant weight cache support? @niuxiaog Can you try to enable weight cache for both bench-gc and OV integration?

niuxiaog commented 1 month ago

I'm working on enabling it with OV and may finish in this week. For bench-gc, maybe next week.

lmontigny commented 1 month ago

waiting for dynamic shape

intel / graph-compiler

bf16 matmul's corresponding `tensor.pack` not properly optimized #320