intel / graph-compiler

Apache License 2.0
13 stars 12 forks source link

const weight packing support #146

Open ZhennanQin opened 2 days ago

ZhennanQin commented 2 days ago

During model inference, model weight is frozen and won't change between iterations. CPU prefers special weight layout to accelerate the execution, then we need to prepack the model weight before model execution. This issue covers below items:

ciyongch commented 2 days ago

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?

ZhennanQin commented 2 days ago

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?

The first datatype that we want to support is BF16. And I don't think weight packing does not apply to INT4. For W4A16 scenario, we can still use a similar block format like NK8k16n2k for weight packing to improve cache locality when converting INT4 to BF16 before BRGEMM. Any special reason making it not applicable?

ciyongch commented 2 days ago

My bad, I was referring to it's not beneficial for the datatype conversion (from INT4 to either FP16 or INT8 higher-precision) step during the entire tensor transformation pipeline. For re-layout, it's always beneficial from cache locality perspective.