Open ZhennanQin opened 4 months ago
It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?
It's beneficial for loading high-precision weight and packing into low precision and cache, while it's not applicable to the INT4 weight in LLM (for example, W4A16 scenario). We might need to clarify that the original weight's datatype?
The first datatype that we want to support is BF16. And I don't think weight packing does not apply to INT4. For W4A16 scenario, we can still use a similar block format like NK8k16n2k for weight packing to improve cache locality when converting INT4 to BF16 before BRGEMM. Any special reason making it not applicable?
My bad, I was referring to it's not beneficial for the datatype conversion (from INT4 to either FP16 or INT8 higher-precision) step during the entire tensor transformation pipeline. For re-layout, it's always beneficial from cache locality perspective.
In openVINO, the IR of a model consists of its topology and constant values, like weights, in memory.
For each Graph, there is a GraphContext
attr. A GraphContext
holds a WeightsSharing
, which is basically a std::unordered_map<std::string, MemoryInfo::Ptr>
that stores the memory of cached tensors.
Take FullyConnectedOp
(FC
for short) with DNNL primitive as an example. Each FC
has a DnnlFCExecutor
, which has an attr of type ExecutorContext
. The ExecutorContext
holds an unordered_map<string, MemoryPtr>
to store the memory of its private cached weights.
In compile stage, the operations (for example, type casting ops) that follow the ConstantOp
(weights, bias or others) will be executed and the results are cached in the unordered_map
of the GraphContext
.
When the FC
has dynamic shape input, which is the case for llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit ReorderOp
in the graph after the ConstantOp
which holds the weight of a FullyConnectedOp
. In the first execution, all the input shapes are defined and the DnnlFCExecutor
is constructed. During the construction, the weight is packed to the blocking format and stored in the unordered_map
of the ExecutorContext
. In later executions, the packed weight can be used directly.
When the FC
has static shape, the above packing and caching process is done in compile stage. All the executions directly use the cached weight.
I think we can still use most of the design of constant tensor cache for oneDNN Graph. The model will be split into two parts, fold
and compute
. Since the actual values of constant tensors are available in openVINO, the fold
part can be compiled and executed in compile stage, and only the compute
part is executed in execution stage.
A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.
A big difference between the openVINO's design and this design is that the compile and execution are operation-wise in openVINO, while with our GC as backend, it will be graph-wise. This may cause some difficulties to integration.
Based on the current MLIR integration here: https://github.com/openvinotoolkit/openvino/commit/4b524cac4ceff85301c9d3b0c6a36755b0da5783#diff-b40ca25e9ca41e663971ae2274f78b5a444a0f3ba5d014d2323e20f199b690b0, a MLIR subgraph will be represented as a single op in OV graph, sounds like it perfectly match.
During model inference, model weight is frozen and won't change between iterations. CPU prefers special weight layout to accelerate the execution, then we need to prepack the model weight before model execution. This issue covers below items: