[CPU] Do not generate `vector.gather` for contiguous loads

dcaballe commented 1 year ago

The tensor.extract below has a contiguous memory access pattern along the adjacent dimension. We should improve the analysis to generate a contiguous vector load:

// -----// IR Dump After LLVMCPUTensorPad (iree-llvmcpu-tensor-pad) //----- //
func.func @forward_dispatch_19_batch_matmul_1024x384x384x64_f32() {
  %c64 = arith.constant 64 : index
  %c128 = arith.constant 128 : index
  %c8 = arith.constant 8 : index
  %c32 = arith.constant 32 : index
  %c384 = arith.constant 384 : index
  %c1024 = arith.constant 1024 : index
  %c100761600 = arith.constant 100761600 : index
  %c201424896 = arith.constant 201424896 : index
  %c0 = arith.constant 0 : index
  %c503414784 = arith.constant 503414784 : index
  %cst = arith.constant 0.000000e+00 : f32
  %c16 = arith.constant 16 : index
  %cst_0 = arith.constant 1.000000e+00 : f32
  %cst_1 = arith.constant -3.40282347E+38 : f32
  %cst_2 = arith.constant 8.000000e+00 : f32
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c100761600) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1024x384x64xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c201424896) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1024x64x384xf32>>
  %2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x1x1x384xf32>>
  %3 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c503414784) : !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>>
  %4 = flow.dispatch.tensor.load %2, offsets = [0, 0, 0, 0], sizes = [64, 1, 1, 384], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x1x1x384xf32>> -> tensor<64x1x1x384xf32>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %workgroup_id_z = hal.interface.workgroup.id[2] : index
  %workgroup_count_z = hal.interface.workgroup.count[2] : index
  scf.for %arg0 = %workgroup_id_z to %c1024 step %workgroup_count_z {
    %5 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
    %6 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
    scf.for %arg1 = %5 to %c384 step %6 {
      %7 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_x]
      %8 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_x]
      scf.for %arg2 = %7 to %c384 step %8 {
        %9 = flow.dispatch.tensor.load %3, offsets = [%arg0, %arg1, %arg2], sizes = [1, 128, 128], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>> -> tensor<1x128x128xf32>
        %10 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [1, 128, 64], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x384x64xf32>> -> tensor<1x128x64xf32>
        %11 = flow.dispatch.tensor.load %1, offsets = [%arg0, 0, %arg2], sizes = [1, 64, 128], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x64x384xf32>> -> tensor<1x64x128xf32>
        %12 = scf.for %arg3 = %c0 to %c128 step %c8 iter_args(%arg4 = %9) -> (tensor<1x128x128xf32>) {
          %13 = scf.for %arg5 = %c0 to %c128 step %c32 iter_args(%arg6 = %arg4) -> (tensor<1x128x128xf32>) {
            %extracted_slice = tensor.extract_slice %10[0, %arg3, 0] [1, 8, 64] [1, 1, 1] : tensor<1x128x64xf32> to tensor<1x8x64xf32>
            %extracted_slice_3 = tensor.extract_slice %11[0, 0, %arg5] [1, 64, 32] [1, 1, 1] : tensor<1x64x128xf32> to tensor<1x64x32xf32>
            %extracted_slice_4 = tensor.extract_slice %arg6[0, %arg3, %arg5] [1, 8, 32] [1, 1, 1] : tensor<1x128x128xf32> to tensor<1x8x32xf32>
            %14 = linalg.fill {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[1, 128, 128], [1, 8, 32], [0, 0, 0], [0, 0, 0]]>} ins(%cst : f32) outs(%extracted_slice_4 : tensor<1x8x32xf32>) -> tensor<1x8x32xf32>
            %15 = scf.for %arg7 = %c0 to %c64 step %c16 iter_args(%arg8 = %14) -> (tensor<1x8x32xf32>) {
              %extracted_slice_5 = tensor.extract_slice %extracted_slice[0, 0, %arg7] [1, 8, 16] [1, 1, 1] : tensor<1x8x64xf32> to tensor<1x8x16xf32>
              %extracted_slice_6 = tensor.extract_slice %extracted_slice_3[0, %arg7, 0] [1, 16, 32] [1, 1, 1] : tensor<1x64x32xf32> to tensor<1x16x32xf32>
              %padded = tensor.pad %extracted_slice_5 nofold low[0, 0, 0] high[0, 0, 0] {
              ^bb0(%arg9: index, %arg10: index, %arg11: index):
                tensor.yield %cst : f32
              } : tensor<1x8x16xf32> to tensor<1x8x16xf32>
              %padded_7 = tensor.pad %extracted_slice_6 nofold low[0, 0, 0] high[0, 0, 0] {
              ^bb0(%arg9: index, %arg10: index, %arg11: index):
                tensor.yield %cst : f32
              } : tensor<1x16x32xf32> to tensor<1x16x32xf32>
              %17 = linalg.batch_matmul {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[1, 128, 128, 0], [1, 8, 32, 0], [0, 0, 0, 16], [0, 0, 0, 0]]>} ins(%padded, %padded_7 : tensor<1x8x16xf32>, tensor<1x16x32xf32>) outs(%arg8 : tensor<1x8x32xf32>) -> tensor<1x8x32xf32>
              scf.yield %17 : tensor<1x8x32xf32>
            }
            %16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} outs(%15 : tensor<1x8x32xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[1, 128, 128], [1, 8, 32], [0, 0, 0], [0, 0, 0]]>} {
            ^bb0(%out: f32):
              %17 = linalg.index 0 : index
              %18 = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%arg0, %17)
              %19 = arith.divui %18, %c16 : index
              %20 = linalg.index 2 : index
              %21 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg2, %20, %arg5)
              %extracted = tensor.extract %4[%19, %c0, %c0, %21] : tensor<64x1x1x384xf32>
              %22 = arith.subf %cst_0, %extracted : f32
              %23 = arith.mulf %22, %cst_1 : f32
              %24 = arith.divf %out, %cst_2 : f32
              %25 = arith.addf %24, %23 : f32
              linalg.yield %25 : f32
            } -> tensor<1x8x32xf32>
            %inserted_slice = tensor.insert_slice %16 into %arg6[0, %arg3, %arg5] [1, 8, 32] [1, 1, 1] : tensor<1x8x32xf32> into tensor<1x128x128xf32>
            scf.yield %inserted_slice : tensor<1x128x128xf32>
          }
          scf.yield %13 : tensor<1x128x128xf32>
        }
        flow.dispatch.tensor.store %12, %3, offsets = [%arg0, %arg1, %arg2], sizes = [1, 128, 128], strides = [1, 1, 1] : tensor<1x128x128xf32> -> !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>>
      }
    }
  }
  return
}

// -----// IR Dump After GenericVectorization (iree-codegen-generic-vectorization) //----- //
func.func @forward_dispatch_19_batch_matmul_1024x384x384x64_f32() {
  %cst = arith.constant dense<384> : vector<1x8x32xindex>
  %cst_0 = arith.constant dense<8.000000e+00> : vector<1x8x32xf32>
  %cst_1 = arith.constant dense<-3.40282347E+38> : vector<1x8x32xf32>
  %cst_2 = arith.constant dense<1.000000e+00> : vector<1x8x32xf32>
  %cst_3 = arith.constant dense<true> : vector<1x8x32xi1>
  %cst_4 = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]> : vector<32xindex>
  %cst_5 = arith.constant dense<16> : vector<1x8x32xindex>
  %cst_6 = arith.constant dense<0.000000e+00> : vector<1x8x32xf32>
  %c1 = arith.constant 1 : index
  %c64 = arith.constant 64 : index
  %c128 = arith.constant 128 : index
  %c8 = arith.constant 8 : index
  %c32 = arith.constant 32 : index
  %c384 = arith.constant 384 : index
  %c1024 = arith.constant 1024 : index
  %c100761600 = arith.constant 100761600 : index
  %c201424896 = arith.constant 201424896 : index
  %c0 = arith.constant 0 : index
  %c503414784 = arith.constant 503414784 : index
  %cst_7 = arith.constant 0.000000e+00 : f32
  %c16 = arith.constant 16 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c100761600) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1024x384x64xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c201424896) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1024x64x384xf32>>
  %2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x1x1x384xf32>>
  %3 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c503414784) : !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>>
  %4 = flow.dispatch.tensor.load %2, offsets = [0, 0, 0, 0], sizes = [64, 1, 1, 384], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x1x1x384xf32>> -> tensor<64x1x1x384xf32>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %workgroup_id_z = hal.interface.workgroup.id[2] : index
  %workgroup_count_z = hal.interface.workgroup.count[2] : index
  scf.for %arg0 = %workgroup_id_z to %c1024 step %workgroup_count_z {
    %5 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
    %6 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
    scf.for %arg1 = %5 to %c384 step %6 {
      %7 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_x]
      %8 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_x]
      scf.for %arg2 = %7 to %c384 step %8 {
        %9 = flow.dispatch.tensor.load %3, offsets = [%arg0, %arg1, %arg2], sizes = [1, 128, 128], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>> -> tensor<1x128x128xf32>
        %10 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [1, 128, 64], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x384x64xf32>> -> tensor<1x128x64xf32>
        %11 = flow.dispatch.tensor.load %1, offsets = [%arg0, 0, %arg2], sizes = [1, 64, 128], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x64x384xf32>> -> tensor<1x64x128xf32>
        %12 = scf.for %arg3 = %c0 to %c128 step %c8 iter_args(%arg4 = %9) -> (tensor<1x128x128xf32>) {
          %13 = scf.for %arg5 = %c0 to %c128 step %c32 iter_args(%arg6 = %arg4) -> (tensor<1x128x128xf32>) {
            %extracted_slice = tensor.extract_slice %10[0, %arg3, 0] [1, 8, 64] [1, 1, 1] : tensor<1x128x64xf32> to tensor<1x8x64xf32>
            %extracted_slice_8 = tensor.extract_slice %11[0, 0, %arg5] [1, 64, 32] [1, 1, 1] : tensor<1x64x128xf32> to tensor<1x64x32xf32>
            %extracted_slice_9 = tensor.extract_slice %arg6[0, %arg3, %arg5] [1, 8, 32] [1, 1, 1] : tensor<1x128x128xf32> to tensor<1x8x32xf32>
            %14 = vector.transfer_write %cst_6, %extracted_slice_9[%c0, %c0, %c0] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x32xf32>
            %15 = scf.for %arg7 = %c0 to %c64 step %c16 iter_args(%arg8 = %14) -> (tensor<1x8x32xf32>) {
              %extracted_slice_10 = tensor.extract_slice %extracted_slice[0, 0, %arg7] [1, 8, 16] [1, 1, 1] : tensor<1x8x64xf32> to tensor<1x8x16xf32>
              %extracted_slice_11 = tensor.extract_slice %extracted_slice_8[0, %arg7, 0] [1, 16, 32] [1, 1, 1] : tensor<1x64x32xf32> to tensor<1x16x32xf32>
              %32 = vector.create_mask %c1, %c8, %c16 : vector<1x8x16xi1>
              %33 = vector.transfer_read %extracted_slice_10[%c0, %c0, %c0], %cst_7, %32 {in_bounds = [true, true, true]} : tensor<1x8x16xf32>, vector<1x8x16xf32>
              %34 = vector.create_mask %c1, %c16, %c32 : vector<1x16x32xi1>
              %35 = vector.transfer_read %extracted_slice_11[%c0, %c0, %c0], %cst_7, %34 {in_bounds = [true, true, true]} : tensor<1x16x32xf32>, vector<1x16x32xf32>
              %36 = vector.transfer_read %arg8[%c0, %c0, %c0], %cst_7 {in_bounds = [true, true, true]} : tensor<1x8x32xf32>, vector<1x8x32xf32>
              %37 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"], kind = #vector.kind<add>} %33, %35, %36 : vector<1x8x16xf32>, vector<1x16x32xf32> into vector<1x8x32xf32>
              %38 = vector.transfer_write %37, %arg8[%c0, %c0, %c0] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x32xf32>
              scf.yield %38 : tensor<1x8x32xf32>
            }
            %16 = vector.transfer_read %15[%c0, %c0, %c0], %cst_7 {in_bounds = [true, true, true]} : tensor<1x8x32xf32>, vector<1x8x32xf32>
            %17 = vector.broadcast %arg0 : index to vector<1x8x32xindex>
            %18 = arith.divui %17, %cst_5 : vector<1x8x32xindex>
            %19 = vector.broadcast %arg2 : index to vector<32xindex>
            %20 = arith.addi %19, %cst_4 : vector<32xindex>
            %21 = vector.broadcast %arg5 : index to vector<32xindex>
            %22 = arith.addi %20, %21 : vector<32xindex>
            %23 = arith.muli %18, %cst : vector<1x8x32xindex>
            %24 = vector.broadcast %22 : vector<32xindex> to vector<1x8x32xindex>
            %25 = arith.addi %24, %23 : vector<1x8x32xindex>
            %26 = vector.gather %4[%c0, %c0, %c0, %c0] [%25], %cst_3, %cst_6 : tensor<64x1x1x384xf32>, vector<1x8x32xindex>, vector<1x8x32xi1>, vector<1x8x32xf32> into vector<1x8x32xf32>
            %27 = arith.subf %cst_2, %26 : vector<1x8x32xf32>
            %28 = arith.mulf %27, %cst_1 : vector<1x8x32xf32>
            %29 = arith.divf %16, %cst_0 : vector<1x8x32xf32>
            %30 = arith.addf %29, %28 : vector<1x8x32xf32>
            %31 = vector.transfer_write %30, %arg6[%c0, %arg3, %arg5] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x128x128xf32>
            scf.yield %31 : tensor<1x128x128xf32>
          }
          scf.yield %13 : tensor<1x128x128xf32>
        }
        flow.dispatch.tensor.store %12, %3, offsets = [%arg0, %arg1, %arg2], sizes = [1, 128, 128], strides = [1, 1, 1] : tensor<1x128x128xf32> -> !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>>
      }
    }
  }
  return
}

Repro:

hal.executable public @forward_dispatch_19 {
  hal.executable.variant public @system_elf_x86_64, target = <"llvm-cpu", "system-elf-x86_64", {cpu = "cascadelake", cpu_features = "+cmov,+mmx,+popcnt,+sse,+sse2,+sse3,+ssse3,+sse4.1,+sse4.2,+avx,+avx2,+fma,+avx512f,+bmi,+bmi2,+aes,+pclmul,+avx512vl,+avx512bw,+avx512dq,+avx512cd,+avx512vnni,+adx,+clflushopt,+clwb,+cx16,+cx8,+crc32,+f16c,+fsgsbase,+fxsr,+invpcid,+lzcnt,+movbe,+pku,+prfchw,+rdrnd,+rdseed,+sahf,+x87,+xsave,+xsavec,+xsaveopt,+xsaves", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", native_vector_size = 64 : index, target_triple = "x86_64-unknown-linux-elf", ukernels = true}> {
    hal.executable.export public @forward_dispatch_19_batch_matmul_1024x384x384x64_f32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>) {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @forward_dispatch_19_batch_matmul_1024x384x384x64_f32() {
        %c100761600 = arith.constant 100761600 : index
        %c201424896 = arith.constant 201424896 : index
        %c0 = arith.constant 0 : index
        %c503414784 = arith.constant 503414784 : index
        %cst = arith.constant 0.000000e+00 : f32
        %c16 = arith.constant 16 : index
        %cst_0 = arith.constant 1.000000e+00 : f32
        %cst_1 = arith.constant -3.40282347E+38 : f32
        %cst_2 = arith.constant 8.000000e+00 : f32
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c100761600) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1024x384x64xf32>>
        %1 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c201424896) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1024x64x384xf32>>
        %2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x1x1x384xf32>>
        %3 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c503414784) : !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>>
        %4 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [1024, 384, 64], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x384x64xf32>> -> tensor<1024x384x64xf32>
        %5 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0], sizes = [1024, 64, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1024x64x384xf32>> -> tensor<1024x64x384xf32>
        %6 = flow.dispatch.tensor.load %2, offsets = [0, 0, 0, 0], sizes = [64, 1, 1, 384], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x1x1x384xf32>> -> tensor<64x1x1x384xf32>
        %7 = tensor.empty() : tensor<1024x384x384xf32>
        %8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<1024x384x384xf32>) -> tensor<1024x384x384xf32>
        %9 = linalg.batch_matmul ins(%4, %5 : tensor<1024x384x64xf32>, tensor<1024x64x384xf32>) outs(%8 : tensor<1024x384x384xf32>) -> tensor<1024x384x384xf32>
        %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%9 : tensor<1024x384x384xf32>) outs(%7 : tensor<1024x384x384xf32>) {
        ^bb0(%in: f32, %out: f32):
          %11 = linalg.index 0 : index
          %12 = arith.divui %11, %c16 : index
          %13 = linalg.index 2 : index
          %extracted = tensor.extract %6[%12, %c0, %c0, %13] : tensor<64x1x1x384xf32>
          %14 = arith.subf %cst_0, %extracted : f32
          %15 = arith.mulf %14, %cst_1 : f32
          %16 = arith.divf %in, %cst_2 : f32
          %17 = arith.addf %16, %15 : f32
          linalg.yield %17 : f32
        } -> tensor<1024x384x384xf32>
        flow.dispatch.tensor.store %10, %3, offsets = [0, 0, 0], sizes = [1024, 384, 384], strides = [1, 1, 1] : tensor<1024x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<1024x384x384xf32>>
        return
      }
    }
  }
}

iree-compile -output-format=vm-bytecode -iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=x86_64-unknown-linux-elf -iree-llvmcpu-target-cpu=cascadelake  module_forward_dispatch_19.mlir

FYI: @banach-space

banach-space commented 1 year ago

Thanks for this repro!

This is not a 1-D contiguous load and such loads are currently marked as gather load: getTensorExtractMemoryAccessPattern. This shouldn't be too difficult to support.

-Andrzej

meshtag commented 10 months ago

Is there something that needs to be done here?

banach-space commented 10 months ago

Is there something that needs to be done here?

Apologies, I dropped the ball there :( Thanks for the reminder - I should have some spare cycles in the next week or two to properly triage this.

meshtag commented 10 months ago

@banach-space, I was having a look at the memory access pattern code you attached above, shouldn't this be vector<1x2x4xi32> or vector<2x1x4xi32> or am I missing out on something crucial?

meshtag commented 9 months ago

@banach-space , If you haven't started working on this already, I'd like to help by creating a patch for it during my spare time.

I am thinking of starting with generating vector.transfer_read for the following minimal example

func.func @example_check(%arg0: tensor<80x16xf32>, %extracted_slice : tensor<4x4xf32>) -> tensor<4x4xf32> {
  %1 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]
  } outs(%extracted_slice : tensor<4x4xf32>) {
  ^bb0(%out: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %extracted = tensor.extract %arg0[%2, %3] : tensor<80x16xf32>
    linalg.yield %extracted : f32
  } -> tensor<4x4xf32>
  return %1 : tensor<4x4xf32>
}

which currently gets lowered to

func.func @example_check(%arg0: tensor<80x16xf32>, %arg1: tensor<4x4xf32>) -> tensor<4x4xf32> {
    %cst = arith.constant dense<[0, 1, 2, 3]> : vector<4xindex>
    %cst_0 = arith.constant dense<true> : vector<4x4xi1>
    %cst_1 = arith.constant dense<0.000000e+00> : vector<4x4xf32>
    %c0 = arith.constant 0 : index
    %cst_2 = arith.constant dense<16> : vector<4x4xindex>
    %0 = vector.broadcast %cst : vector<4xindex> to vector<4x4xindex>
    %1 = arith.muli %0, %cst_2 : vector<4x4xindex>
    %2 = vector.transpose %1, [1, 0] : vector<4x4xindex> to vector<4x4xindex>
    %3 = vector.broadcast %cst : vector<4xindex> to vector<4x4xindex>
    %4 = arith.addi %3, %2 : vector<4x4xindex>
    %5 = vector.gather %arg0[%c0, %c0] [%4], %cst_0, %cst_1 : tensor<80x16xf32>, vector<4x4xindex>, vector<4x4xi1>, vector<4x4xf32> into vector<4x4xf32>
    %6 = vector.transfer_write %5, %arg1[%c0, %c0] {in_bounds = [true, true]} : vector<4x4xf32>, tensor<4x4xf32>
    return %6 : tensor<4x4xf32>
  }

Edit: I have created a first draft pull request for this and would love to get some feedback. Thanks!

banach-space commented 9 months ago

Hi @meshtag , thanks for looking into this and apologies for not responding earlier - I was a bit overwhelmed with other activities.

@banach-space, I was having a look at the memory access pattern code you attached above, shouldn't this be vector<1x2x4xi32> or vector<2x1x4xi32> or am I missing out on something crucial?

Note that that method analyses tensor.extract Ops - at that level there are no vectors. There are also no n-D tensors, hence "vector" rather than "tensor" in:

//    * an n-D vector, like`tensor<1x2x4xi32` or`tensor<2x1x4xi32>"

Naming is hard :/ Here's very small update that will hopefully disambiguate things - https://github.com/llvm/llvm-project/pull/76797.

As for the original issue, this is the root cause leading to gather loads:

          %11 = linalg.index 0 : index
          %12 = arith.divui %11, %c16 : index
          %13 = linalg.index 2 : index
          %extracted = tensor.extract %6[%12, %c0, %c0, %13] : tensor<64x1x1x384xf32>

Specifically, %12 = arith.divui %11, %c16 : index. That's a constant for 16 iterations in the leading dim, but then it goes up by 1. That's not really a contiguous access :( It's not clear to me that the vectoriser would ever be able to do anything clever here.

Instead of trying to fix this in the vectoriser, we should tile the following operation before vectorisation:

%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%9 : tensor<1024x384x384xf32>) outs(%7 : tensor<1024x384x384xf32>) {
        ^bb0(%in: f32, %out: f32):
          %11 = linalg.index 0 : index
          %12 = arith.divui %11, %c16 : index
          %13 = linalg.index 2 : index
          %extracted = tensor.extract %6[%12, %c0, %c0, %13] : tensor<64x1x1x384xf32>
          %14 = arith.subf %cst_0, %extracted : f32
          %15 = arith.mulf %14, %cst_1 : f32
          %16 = arith.divf %in, %cst_2 : f32
          %17 = arith.addf %16, %15 : f32
          linalg.yield %17 : f32
        } -> tensor<1024x384x384xf32>

@dcaballe WDYT?

@banach-space , If you haven't started working on this already, I'd like to help by creating a patch for it during my spare time.

I am thinking of starting with generating vector.transfer_read for the following minimal example

You are proposing to extend the vectoriser so that it can generate contiguous loads for reading n-D vectors - 2D in your example. TBH, it's not obvious to me whether that would help for this specific issue. But I might be missing something. Do you have any other example where this would be helpful?

@dcaballe If you agree that the access pattern in your example is indeed a gather load, then I suggest closing this and to continue the discussion on contiguous loads of n-D vectors elsewhere (perhaps https://github.com/llvm/llvm-project/pull/76436).

meshtag commented 9 months ago

Naming is hard :/ Here's very small update that will hopefully disambiguate things - https://github.com/llvm/llvm-project/pull/76797.

Thanks!

You are proposing to extend the vectoriser so that it can generate contiguous loads for reading n-D vectors - 2D in your example. TBH, it's not obvious to me whether that would help for this specific issue. But I might be missing something. Do you have any other example where this would be helpful?

I did not intend to solve this issue completely with that PR (and I agree that the change would not really benefit this case atm). To me, it seemed like the next simplest case (in continuation to what exists already) where we can try to get vector.transfer_read instead of vector.gather. There is a scenario in one of my downstream projects where we would benefit from that change directly. It's still a work in progress though.

iree-org / iree

[CPU] Do not generate `vector.gather` for contiguous loads #14446