intel / graph-compiler

MLIR-based toolkit targeting intel heterogeneous hardware
Apache License 2.0
32 stars 15 forks source link

const weight packing support #146

Open ZhennanQin opened 4 months ago

ZhennanQin commented 4 months ago

During model inference, model weight is frozen and won't change between iterations. CPU prefers special weight layout to accelerate the execution, then we need to prepack the model weight before model execution. This issue covers below items:

ciyongch commented 4 months ago

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?

ZhennanQin commented 4 months ago

It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?

The first datatype that we want to support is BF16. And I don't think weight packing does not apply to INT4. For W4A16 scenario, we can still use a similar block format like NK8k16n2k for weight packing to improve cache locality when converting INT4 to BF16 before BRGEMM. Any special reason making it not applicable?

ciyongch commented 4 months ago

My bad, I was referring to it's not beneficial for the datatype conversion (from INT4 to either FP16 or INT8 higher-precision) step during the entire tensor transformation pipeline. For re-layout, it's always beneficial from cache locality perspective.

niuxiaog commented 4 months ago

In openVINO, the IR of a model consists of its topology and constant values, like weights, in memory.

For each Graph, there is a GraphContext attr. A GraphContext holds a WeightsSharing, which is basically a std::unordered_map<std::string, MemoryInfo::Ptr> that stores the memory of cached tensors.

Take FullyConnectedOp (FC for short) with DNNL primitive as an example. Each FC has a DnnlFCExecutor, which has an attr of type ExecutorContext. The ExecutorContext holds an unordered_map<string, MemoryPtr> to store the memory of its private cached weights.

In compile stage, the operations (for example, type casting ops) that follow the ConstantOp (weights, bias or others) will be executed and the results are cached in the unordered_map of the GraphContext.

When the FC has dynamic shape input, which is the case for llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit ReorderOp in the graph after the ConstantOp which holds the weight of a FullyConnectedOp. In the first execution, all the input shapes are defined and the DnnlFCExecutor is constructed. During the construction, the weight is packed to the blocking format and stored in the unordered_map of the ExecutorContext. In later executions, the packed weight can be used directly.

When the FC has static shape, the above packing and caching process is done in compile stage. All the executions directly use the cached weight.

I think we can still use most of the design of constant tensor cache for oneDNN Graph. The model will be split into two parts, fold and compute. Since the actual values of constant tensors are available in openVINO, the fold part can be compiled and executed in compile stage, and only the compute part is executed in execution stage.

A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.

ZhennanQin commented 4 months ago

A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.

Based on the current MLIR integration here: https://github.com/openvinotoolkit/openvino/commit/4b524cac4ceff85301c9d3b0c6a36755b0da5783#diff-b40ca25e9ca41e663971ae2274f78b5a444a0f3ba5d014d2323e20f199b690b0, a MLIR subgraph will be represented as a single op in OV graph, sounds like it perfectly match.