intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
130 stars 37 forks source link

Question on the vector add example #2099

Open pbchekin opened 3 weeks ago

pbchekin commented 3 weeks ago

Received this:

A question after analyzing the LLVM IR generated and comparing against the SYCL vector add equivalent. In the normal SYCL program, we tend to load the input data as vector of size 4 and compute the addition on that vector. In Triton case, we load as vector from memory but tend to extract each element and add them as scalar and insert them back. Attached are the shader dump from Triton vector add run which includes the input to the IGC compiler before unification and IGC optimized LLVM IR.

OCL_asm67166d3621db5283_beforeUnification.zip OCL_asm67166d3621db5283_optimized.zip

etiotto commented 1 week ago

IGC has a pass that scalarizes the vector addition. Before that pass the LLVM IR is:

  %42 = fadd <4 x float> %bc3, %bc9, !dbg !373
  %43 = fadd <4 x float> %bc3, %bc9, !dbg !373
  %44 = fadd <4 x float> %bc3, %bc9, !dbg !373
  %45 = shufflevector <4 x float> %43, <4 x float> %44, <4 x i32> <i32 0, i32 5, i32 undef, i32 undef>, !dbg !375
  %46 = fadd <4 x float> %bc3, %bc9, !dbg !373
  %47 = shufflevector <4 x float> %45, <4 x float> %46, <4 x i32> <i32 0, i32 1, i32 6, i32 undef>, !dbg !375
  %48 = shufflevector <4 x float> %47, <4 x float> %42, <4 x i32> <i32 0, i32 1, i32 2, i32 7>, !dbg !375
  %49 = sext i32 %9 to i64, !dbg !374
  %50 = getelementptr float, float addrspace(1)* %2, i64 %49, !dbg !374
  %51 = bitcast float addrspace(1)* %50 to <4 x float> addrspace(1)*, !dbg !375
  store <4 x float> %48, <4 x float> addrspace(1)* %51, align 16, !dbg !375

and after that pass the vector add is scalarized:


59:                                               ; preds = %52, %51
  %bc1226 = phi float [ %55, %52 ], [ 0.000000e+00, %51 ], !dbg !371
  %bc1227 = phi float [ %56, %52 ], [ 0.000000e+00, %51 ], !dbg !371
  %bc1228 = phi float [ %57, %52 ], [ 0.000000e+00, %51 ], !dbg !371 the 
  %bc1229 = phi float [ %58, %52 ], [ 0.000000e+00, %51 ], !dbg !371
  %60 = fadd float %bc618, %bc1226, !dbg !372
  %61 = fadd float %bc619, %bc1227, !dbg !372
  %62 = fadd float %bc620, %bc1228, !dbg !372
  %63 = fadd float %bc621, %bc1229, !dbg !372
  %64 = fadd float %bc618, %bc1226, !dbg !372
  %65 = fadd float %bc619, %bc1227, !dbg !372
  %66 = fadd float %bc620, %bc1228, !dbg !372
  %67 = fadd float %bc621, %bc1229, !dbg !372
  %68 = fadd float %bc618, %bc1226, !dbg !372
  %69 = fadd float %bc619, %bc1227, !dbg !372
  %70 = fadd float %bc620, %bc1228, !dbg !372
  %71 = fadd float %bc621, %bc1229, !dbg !372
  %72 = fadd float %bc618, %bc1226, !dbg !372
  %73 = fadd float %bc619, %bc1227, !dbg !372
  %74 = fadd float %bc620, %bc1228, !dbg !372
  %75 = fadd float %bc621, %bc1229, !dbg !372
  %76 = getelementptr float, float addrspace(1)* %2, i64 %21, !dbg !373rformed by IG
  br i1 %19, label %77, label %101, !dbg !374

So this is a transformation performed by IGC. Triton generates the vector code. Is unclear at this point the reason the SYCL program is not scalarized. @pbchekin who is the contact and can we get the SYCL code reproducer along with compilation command?

whitneywhtsang commented 1 week ago

@etiotto Can you give open-linux-driver-ci-dev_igc-17737 a try? It contains a recent change which makes that IGC pass more restrictive.

alexbaden commented 1 week ago

Are we confusing vector types and vectorization? SYCL has a vec4 type which is syntactic sugar for unpacking a struct. https://developer.codeplay.com/products/computecpp/ce/2.11.0/api-reference/vec__types__defines_8h.html

etiotto commented 1 week ago

Are we confusing vector types and vectorization? SYCL has a vec4 type which is syntactic sugar for unpacking a struct. https://developer.codeplay.com/products/computecpp/ce/2.11.0/api-reference/vec__types__defines_8h.html

I don't have the SYCL program, however from the original question I am guessing the LLVM IR generated by SYCL would contain vector adds and that for some reasons IGC doesn't scalarize them. When we get the SYCL program we can check the LLVM IR it generates.

etiotto commented 1 week ago

@pbchekin do you have the contact info for the person that asked the original question?