intel / graph-compiler

MLIR-based toolkit targeting intel heterogeneous hardware
Apache License 2.0
30 stars 14 forks source link

Support dynamic shape when lowering operation to vector #300

Open BRUCE11111 opened 1 month ago

BRUCE11111 commented 1 month ago

e.g.:

    %19 = affine.apply affine_map<(d0) -> ((d0 floordiv 64) * 64)>(%arg3)
    %20 = affine.min affine_map<(d0) -> (((d0 + 31) floordiv 64) * 64 - (d0 floordiv 64) * 64 + 64, (d0 floordiv 64) * -64 + 224)>(%arg3)
    %extracted_slice = tensor.extract_slice %arg0[%19, 0] [%20, 13] [1, 1] : tensor<224x13xbf16> to tensor<?x13xbf16>
    %extracted_slice_2 = tensor.extract_slice %1[%15, 0, 0, 0] [%17, 1, 64, 14] [1, 1, 1, 1] : tensor<4x1x64x14xbf16> to tensor<?x1x64x14xbf16>
    %pack = tensor.pack %extracted_slice padding_value(%cst : bf16) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [64, 14] into %extracted_slice_2 : tensor<?x13xbf16> -> tensor<?x1x64x14xbf16>
    %21 = tensor.empty(%18) : tensor<?x14xbf16>
    %unpack = tensor.unpack %pack outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [64, 14] into %21 : tensor<?x1x64x14xbf16> -> tensor<?x14xbf16>
yifeizh2 commented 1 month ago

Another follow up of this case.

Dynamic shape will also occur when loop step is indivisible. For llama mlp matmul shape:

numactl -C 0-55 -m 0 python3 ./tools/main.py --driver=mlp --batch_size=32 --hidden_size_list=11008x4096 --has_bias=4096 --act_type=relu --dtype=bf16 --warm_up=100 --repeat=500

The generated IR after deepTileMatmul contains dynamic shape ? due to indivisible loop step. The shapes involved in vector operations are still static. However, currently CPUPhysicalRegister pass skipped processing this case.

The detailed IR log is attached. llama_mlp_32x11008x4096.txt