intel / graph-compiler

MLIR-based toolkit targeting intel heterogeneous hardware
Apache License 2.0
32 stars 16 forks source link

bf16 matmul's corresponding `tensor.pack` not properly optimized #320

Open yifeizh2 opened 2 months ago

yifeizh2 commented 2 months ago

Currently, the following 2 single-layer MLP have worst performance compared with GC v1. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

dtype | batch size | hidden list | GC V1 | 8c55a0544 remove brgemm read lock -- | -- | -- | -- | -- bf16 | 128 | 1024x1024 | 0.0286 | 0.0828 | 34.52% bf16 | 128 | 1024x512 | 0.0204 | 0.0670 | 30.45%

We performed detailed breakdown as follows:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

128x1024x1024 | GC v1 | 8c55a0544 -- | -- | -- matmul only | 0.01766 | 0.01989 tiled pack (or reorder) | 0.02634 | 0.04632 total | 0.04418 | 0.077969

and

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

128x1024x512 | GC v1 | 8c55a0544 -- | -- | -- matmul only | 0.01587 | 0.01591 tiled pack (or reorder) | 0.01278 | 0.0398 total | 0.02881 | 0.06917

Are there any further optimization opportunity for vnni pack?

BRUCE11111 commented 2 months ago

VNNI reorder will be included in my to-do list. However, the current priority is to merge the physical register pass and the corresponding vector-based op fusion under static shape into master as soon as possible (within two weeks). Then support dynamic shape for the sake of another issue, and then optimize the instruction level of specific op like vnni reorder. I can switch priorities if there is a more urgent need.

ZhennanQin commented 2 months ago

I guess those VNNI reorder can be folded out if we have constant weight cache support? @niuxiaog Can you try to enable weight cache for both bench-gc and OV integration?

niuxiaog commented 2 months ago

I'm working on enabling it with OV and may finish in this week. For bench-gc, maybe next week.

lmontigny commented 2 months ago

waiting for dynamic shape