Optimize codegen for so-called "unbroadcast" layout conversions

victor-eds commented 1 week ago

-tritonintelgpu-optimize-elementwise-parallelism introduces "unbroadcast" layout conversions. These should be total NOPs as simply involve dropping some values held by multiple threads, e.g., going from:

#blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
#slice = #triton_gpu.slice<{dim = 1, parent = #blocked}>

to:

#blocked1 = #triton_gpu.blocked<{sizePerThread = [1], threadsPerWarp = [16], warpsPerCTA = [1], order = [0]}>

Note for a tensor<16xf32>, this would mean going from:

[  T0:0|  T1:0|  T2:0|  T3:0|  T4:0|  T5:0|  T6:0|  T7:0|  T8:0|  T9:0| T10:0| T11:0| T12:0| T13:0| T14:0| T15:0,   T0:1|  T1:1|  T2:1|  T3:1|  T4:1|  T5:1|  T6:1|  T7:1|  T8:1|  T9:1| T10:1| T11:1| T12:1| T13:1| T14:1| T15:1,   T0:2|  T1:2|  T2:2|  T3:2|  T4:2|  T5:2|  T6:2|  T7:2|  T8:2|  T9:2| T10:2| T11:2| T12:2| T13:2| T14:2| T15:2,   T0:3|  T1:3|  T2:3|  T3:3|  T4:3|  T5:3|  T6:3|  T7:3|  T8:3|  T9:3| T10:3| T11:3| T12:3| T13:3| T14:3| T15:3,   T0:4|  T1:4|  T2:4|  T3:4|  T4:4|  T5:4|  T6:4|  T7:4|  T8:4|  T9:4| T10:4| T11:4| T12:4| T13:4| T14:4| T15:4,   T0:5|  T1:5|  T2:5|  T3:5|  T4:5|  T5:5|  T6:5|  T7:5|  T8:5|  T9:5| T10:5| T11:5| T12:5| T13:5| T14:5| T15:5,   T0:6|  T1:6|  T2:6|  T3:6|  T4:6|  T5:6|  T6:6|  T7:6|  T8:6|  T9:6| T10:6| T11:6| T12:6| T13:6| T14:6| T15:6,   T0:7|  T1:7|  T2:7|  T3:7|  T4:7|  T5:7|  T6:7|  T7:7|  T8:7|  T9:7| T10:7| T11:7| T12:7| T13:7| T14:7| T15:7,   T0:8|  T1:8|  T2:8|  T3:8|  T4:8|  T5:8|  T6:8|  T7:8|  T8:8|  T9:8| T10:8| T11:8| T12:8| T13:8| T14:8| T15:8,   T0:9|  T1:9|  T2:9|  T3:9|  T4:9|  T5:9|  T6:9|  T7:9|  T8:9|  T9:9| T10:9| T11:9| T12:9| T13:9| T14:9| T15:9,  T0:10| T1:10| T2:10| T3:10| T4:10| T5:10| T6:10| T7:10| T8:10| T9:10|T10:10|T11:10|T12:10|T13:10|T14:10|T15:10,  T0:11| T1:11| T2:11| T3:11| T4:11| T5:11| T6:11| T7:11| T8:11| T9:11|T10:11|T11:11|T12:11|T13:11|T14:11|T15:11,  T0:12| T1:12| T2:12| T3:12| T4:12| T5:12| T6:12| T7:12| T8:12| T9:12|T10:12|T11:12|T12:12|T13:12|T14:12|T15:12,  T0:13| T1:13| T2:13| T3:13| T4:13| T5:13| T6:13| T7:13| T8:13| T9:13|T10:13|T11:13|T12:13|T13:13|T14:13|T15:13,  T0:14| T1:14| T2:14| T3:14| T4:14| T5:14| T6:14| T7:14| T8:14| T9:14|T10:14|T11:14|T12:14|T13:14|T14:14|T15:14,  T0:15| T1:15| T2:15| T3:15| T4:15| T5:15| T6:15| T7:15| T8:15| T9:15|T10:15|T11:15|T12:15|T13:15|T14:15|T15:15]

To:

[ T0:0,  T1:0,  T2:0,  T3:0,  T4:0,  T5:0,  T6:0,  T7:0,  T8:0,  T9:0, T10:0, T11:0, T12:0, T13:0, T14:0, T15:0]

victor-eds commented 1 week ago

WIP. Should be ready soonish.

victor-eds commented 4 days ago

Code ready to push. Evaluating whether this is needed after all. Will push PR or close this issue as won't fix this week.

victor-eds commented 3 days ago

Not needed. The EW optimization pass isn't needed as we can modify previous passes so optimal layouts are propagated.

intel / intel-xpu-backend-for-triton

Optimize codegen for so-called "unbroadcast" layout conversions #2674