[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values

RKSimon commented 5 months ago

The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.

From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).

This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.

llvmbot commented 5 months ago

@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-llvm-transforms

Author: Simon Pilgrim (RKSimon)

Changes

The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance. From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy). This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu. --- Patch is 83.40 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91340.diff 3 Files Affected: - (modified) llvm/lib/Target/X86/X86ScheduleZnver3.td (+4-7) - (modified) llvm/lib/Target/X86/X86ScheduleZnver4.td (+5-11) - (modified) llvm/test/Transforms/LoopUnroll/X86/znver3.ll (+35-947) ``````````diff diff --git a/llvm/lib/Target/X86/X86ScheduleZnver3.td b/llvm/lib/Target/X86/X86ScheduleZnver3.td index 2e87d5262818c..cbf1de8408798 100644 --- a/llvm/lib/Target/X86/X86ScheduleZnver3.td +++ b/llvm/lib/Target/X86/X86ScheduleZnver3.td @@ -33,13 +33,10 @@ def Znver3Model : SchedMachineModel { // The op cache is organized as an associative cache with 64 sets and 8 ways. // At each set-way intersection is an entry containing up to 8 macro ops. // The maximum capacity of the op cache is 4K ops. - // Agner, 22.5 µop cache - // The size of the µop cache is big enough for holding most critical loops. - // FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity, - // with large values here the compilation of certain loops - // ends up taking way too long. - // let LoopMicroOpBufferSize = 4096; - let LoopMicroOpBufferSize = 512; + // Assuming a maximum dispatch of 8 ops/cy and a mispredict cost of 12cy from + // the op-cache, we limit the loop buffer to 8*12 = 96 to avoid loop unrolling + // leading to excessive filling of the op-cache from frontend. + let LoopMicroOpBufferSize = 96; // AMD SOG 19h, 2.6.2 L1 Data Cache // The L1 data cache has a 4- or 5- cycle integer load-to-use latency. // AMD SOG 19h, 2.12 L1 Data Cache diff --git a/llvm/lib/Target/X86/X86ScheduleZnver4.td b/llvm/lib/Target/X86/X86ScheduleZnver4.td index dac4d8422582a..7107dbc63e279 100644 --- a/llvm/lib/Target/X86/X86ScheduleZnver4.td +++ b/llvm/lib/Target/X86/X86ScheduleZnver4.td @@ -28,17 +28,11 @@ def Znver4Model : SchedMachineModel { // AMD SOG 19h, 2.9.1 Op Cache // The op cache is organized as an associative cache with 64 sets and 8 ways. // At each set-way intersection is an entry containing up to 8 macro ops. - // The maximum capacity of the op cache is 4K ops. - // Agner, 22.5 µop cache - // The size of the µop cache is big enough for holding most critical loops. - // FIXME: PR50584: MachineScheduler/PostRAScheduler have quadradic complexity, - // with large values here the compilation of certain loops - // ends up taking way too long. - // Ideally for znver4, we should have 6.75K. However we don't add that - // considerting the impact compile time and prefer using default values - // instead. - // Retaining minimal value to influence unrolling as we did for znver3. - let LoopMicroOpBufferSize = 512; + // The maximum capacity of the op cache is 6.75K ops. + // Assuming a maximum dispatch of 9 ops/cy and a mispredict cost of 12cy from + // the op-cache, we limit the loop buffer to 9*12 = 108 to avoid loop + // unrolling leading to excessive filling of the op-cache from frontend. + let LoopMicroOpBufferSize = 108; // AMD SOG 19h, 2.6.2 L1 Data Cache // The L1 data cache has a 4- or 5- cycle integer load-to-use latency. // AMD SOG 19h, 2.12 L1 Data Cache diff --git a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll index 30389062a0967..b1f1d7d814e6c 100644 --- a/llvm/test/Transforms/LoopUnroll/X86/znver3.ll +++ b/llvm/test/Transforms/LoopUnroll/X86/znver3.ll @@ -73,456 +73,8 @@ define i32 @test(ptr %ary) "target-cpu"="znver3" { ; CHECK-NEXT: [[INDVARS_IV_NEXT_14:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 15 ; CHECK-NEXT: [[ARRAYIDX_15:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_14]] ; CHECK-NEXT: [[VAL_15:%.*]] = load i32, ptr [[ARRAYIDX_15]], align 4 -; CHECK-NEXT: [[SUM_NEXT_15:%.*]] = add nsw i32 [[VAL_15]], [[SUM_NEXT_14]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_15:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 16 -; CHECK-NEXT: [[ARRAYIDX_16:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_15]] -; CHECK-NEXT: [[VAL_16:%.*]] = load i32, ptr [[ARRAYIDX_16]], align 4 -; CHECK-NEXT: [[SUM_NEXT_16:%.*]] = add nsw i32 [[VAL_16]], [[SUM_NEXT_15]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_16:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 17 -; CHECK-NEXT: [[ARRAYIDX_17:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_16]] -; CHECK-NEXT: [[VAL_17:%.*]] = load i32, ptr [[ARRAYIDX_17]], align 4 -; CHECK-NEXT: [[SUM_NEXT_17:%.*]] = add nsw i32 [[VAL_17]], [[SUM_NEXT_16]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_17:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 18 -; CHECK-NEXT: [[ARRAYIDX_18:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_17]] -; CHECK-NEXT: [[VAL_18:%.*]] = load i32, ptr [[ARRAYIDX_18]], align 4 -; CHECK-NEXT: [[SUM_NEXT_18:%.*]] = add nsw i32 [[VAL_18]], [[SUM_NEXT_17]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_18:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 19 -; CHECK-NEXT: [[ARRAYIDX_19:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_18]] -; CHECK-NEXT: [[VAL_19:%.*]] = load i32, ptr [[ARRAYIDX_19]], align 4 -; CHECK-NEXT: [[SUM_NEXT_19:%.*]] = add nsw i32 [[VAL_19]], [[SUM_NEXT_18]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_19:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 20 -; CHECK-NEXT: [[ARRAYIDX_20:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_19]] -; CHECK-NEXT: [[VAL_20:%.*]] = load i32, ptr [[ARRAYIDX_20]], align 4 -; CHECK-NEXT: [[SUM_NEXT_20:%.*]] = add nsw i32 [[VAL_20]], [[SUM_NEXT_19]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_20:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 21 -; CHECK-NEXT: [[ARRAYIDX_21:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_20]] -; CHECK-NEXT: [[VAL_21:%.*]] = load i32, ptr [[ARRAYIDX_21]], align 4 -; CHECK-NEXT: [[SUM_NEXT_21:%.*]] = add nsw i32 [[VAL_21]], [[SUM_NEXT_20]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_21:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 22 -; CHECK-NEXT: [[ARRAYIDX_22:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_21]] -; CHECK-NEXT: [[VAL_22:%.*]] = load i32, ptr [[ARRAYIDX_22]], align 4 -; CHECK-NEXT: [[SUM_NEXT_22:%.*]] = add nsw i32 [[VAL_22]], [[SUM_NEXT_21]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_22:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 23 -; CHECK-NEXT: [[ARRAYIDX_23:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_22]] -; CHECK-NEXT: [[VAL_23:%.*]] = load i32, ptr [[ARRAYIDX_23]], align 4 -; CHECK-NEXT: [[SUM_NEXT_23:%.*]] = add nsw i32 [[VAL_23]], [[SUM_NEXT_22]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_23:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 24 -; CHECK-NEXT: [[ARRAYIDX_24:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_23]] -; CHECK-NEXT: [[VAL_24:%.*]] = load i32, ptr [[ARRAYIDX_24]], align 4 -; CHECK-NEXT: [[SUM_NEXT_24:%.*]] = add nsw i32 [[VAL_24]], [[SUM_NEXT_23]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_24:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 25 -; CHECK-NEXT: [[ARRAYIDX_25:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_24]] -; CHECK-NEXT: [[VAL_25:%.*]] = load i32, ptr [[ARRAYIDX_25]], align 4 -; CHECK-NEXT: [[SUM_NEXT_25:%.*]] = add nsw i32 [[VAL_25]], [[SUM_NEXT_24]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_25:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 26 -; CHECK-NEXT: [[ARRAYIDX_26:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_25]] -; CHECK-NEXT: [[VAL_26:%.*]] = load i32, ptr [[ARRAYIDX_26]], align 4 -; CHECK-NEXT: [[SUM_NEXT_26:%.*]] = add nsw i32 [[VAL_26]], [[SUM_NEXT_25]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_26:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 27 -; CHECK-NEXT: [[ARRAYIDX_27:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_26]] -; CHECK-NEXT: [[VAL_27:%.*]] = load i32, ptr [[ARRAYIDX_27]], align 4 -; CHECK-NEXT: [[SUM_NEXT_27:%.*]] = add nsw i32 [[VAL_27]], [[SUM_NEXT_26]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_27:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 28 -; CHECK-NEXT: [[ARRAYIDX_28:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_27]] -; CHECK-NEXT: [[VAL_28:%.*]] = load i32, ptr [[ARRAYIDX_28]], align 4 -; CHECK-NEXT: [[SUM_NEXT_28:%.*]] = add nsw i32 [[VAL_28]], [[SUM_NEXT_27]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_28:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 29 -; CHECK-NEXT: [[ARRAYIDX_29:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_28]] -; CHECK-NEXT: [[VAL_29:%.*]] = load i32, ptr [[ARRAYIDX_29]], align 4 -; CHECK-NEXT: [[SUM_NEXT_29:%.*]] = add nsw i32 [[VAL_29]], [[SUM_NEXT_28]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_29:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 30 -; CHECK-NEXT: [[ARRAYIDX_30:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_29]] -; CHECK-NEXT: [[VAL_30:%.*]] = load i32, ptr [[ARRAYIDX_30]], align 4 -; CHECK-NEXT: [[SUM_NEXT_30:%.*]] = add nsw i32 [[VAL_30]], [[SUM_NEXT_29]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_30:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 31 -; CHECK-NEXT: [[ARRAYIDX_31:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_30]] -; CHECK-NEXT: [[VAL_31:%.*]] = load i32, ptr [[ARRAYIDX_31]], align 4 -; CHECK-NEXT: [[SUM_NEXT_31:%.*]] = add nsw i32 [[VAL_31]], [[SUM_NEXT_30]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_31:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 32 -; CHECK-NEXT: [[ARRAYIDX_32:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_31]] -; CHECK-NEXT: [[VAL_32:%.*]] = load i32, ptr [[ARRAYIDX_32]], align 4 -; CHECK-NEXT: [[SUM_NEXT_32:%.*]] = add nsw i32 [[VAL_32]], [[SUM_NEXT_31]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_32:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 33 -; CHECK-NEXT: [[ARRAYIDX_33:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_32]] -; CHECK-NEXT: [[VAL_33:%.*]] = load i32, ptr [[ARRAYIDX_33]], align 4 -; CHECK-NEXT: [[SUM_NEXT_33:%.*]] = add nsw i32 [[VAL_33]], [[SUM_NEXT_32]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_33:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 34 -; CHECK-NEXT: [[ARRAYIDX_34:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_33]] -; CHECK-NEXT: [[VAL_34:%.*]] = load i32, ptr [[ARRAYIDX_34]], align 4 -; CHECK-NEXT: [[SUM_NEXT_34:%.*]] = add nsw i32 [[VAL_34]], [[SUM_NEXT_33]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_34:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 35 -; CHECK-NEXT: [[ARRAYIDX_35:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_34]] -; CHECK-NEXT: [[VAL_35:%.*]] = load i32, ptr [[ARRAYIDX_35]], align 4 -; CHECK-NEXT: [[SUM_NEXT_35:%.*]] = add nsw i32 [[VAL_35]], [[SUM_NEXT_34]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_35:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 36 -; CHECK-NEXT: [[ARRAYIDX_36:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_35]] -; CHECK-NEXT: [[VAL_36:%.*]] = load i32, ptr [[ARRAYIDX_36]], align 4 -; CHECK-NEXT: [[SUM_NEXT_36:%.*]] = add nsw i32 [[VAL_36]], [[SUM_NEXT_35]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_36:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 37 -; CHECK-NEXT: [[ARRAYIDX_37:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_36]] -; CHECK-NEXT: [[VAL_37:%.*]] = load i32, ptr [[ARRAYIDX_37]], align 4 -; CHECK-NEXT: [[SUM_NEXT_37:%.*]] = add nsw i32 [[VAL_37]], [[SUM_NEXT_36]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_37:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 38 -; CHECK-NEXT: [[ARRAYIDX_38:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_37]] -; CHECK-NEXT: [[VAL_38:%.*]] = load i32, ptr [[ARRAYIDX_38]], align 4 -; CHECK-NEXT: [[SUM_NEXT_38:%.*]] = add nsw i32 [[VAL_38]], [[SUM_NEXT_37]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_38:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 39 -; CHECK-NEXT: [[ARRAYIDX_39:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_38]] -; CHECK-NEXT: [[VAL_39:%.*]] = load i32, ptr [[ARRAYIDX_39]], align 4 -; CHECK-NEXT: [[SUM_NEXT_39:%.*]] = add nsw i32 [[VAL_39]], [[SUM_NEXT_38]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_39:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 40 -; CHECK-NEXT: [[ARRAYIDX_40:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_39]] -; CHECK-NEXT: [[VAL_40:%.*]] = load i32, ptr [[ARRAYIDX_40]], align 4 -; CHECK-NEXT: [[SUM_NEXT_40:%.*]] = add nsw i32 [[VAL_40]], [[SUM_NEXT_39]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_40:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 41 -; CHECK-NEXT: [[ARRAYIDX_41:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_40]] -; CHECK-NEXT: [[VAL_41:%.*]] = load i32, ptr [[ARRAYIDX_41]], align 4 -; CHECK-NEXT: [[SUM_NEXT_41:%.*]] = add nsw i32 [[VAL_41]], [[SUM_NEXT_40]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_41:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 42 -; CHECK-NEXT: [[ARRAYIDX_42:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_41]] -; CHECK-NEXT: [[VAL_42:%.*]] = load i32, ptr [[ARRAYIDX_42]], align 4 -; CHECK-NEXT: [[SUM_NEXT_42:%.*]] = add nsw i32 [[VAL_42]], [[SUM_NEXT_41]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_42:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 43 -; CHECK-NEXT: [[ARRAYIDX_43:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_42]] -; CHECK-NEXT: [[VAL_43:%.*]] = load i32, ptr [[ARRAYIDX_43]], align 4 -; CHECK-NEXT: [[SUM_NEXT_43:%.*]] = add nsw i32 [[VAL_43]], [[SUM_NEXT_42]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_43:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 44 -; CHECK-NEXT: [[ARRAYIDX_44:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_43]] -; CHECK-NEXT: [[VAL_44:%.*]] = load i32, ptr [[ARRAYIDX_44]], align 4 -; CHECK-NEXT: [[SUM_NEXT_44:%.*]] = add nsw i32 [[VAL_44]], [[SUM_NEXT_43]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_44:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 45 -; CHECK-NEXT: [[ARRAYIDX_45:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_44]] -; CHECK-NEXT: [[VAL_45:%.*]] = load i32, ptr [[ARRAYIDX_45]], align 4 -; CHECK-NEXT: [[SUM_NEXT_45:%.*]] = add nsw i32 [[VAL_45]], [[SUM_NEXT_44]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_45:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 46 -; CHECK-NEXT: [[ARRAYIDX_46:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_45]] -; CHECK-NEXT: [[VAL_46:%.*]] = load i32, ptr [[ARRAYIDX_46]], align 4 -; CHECK-NEXT: [[SUM_NEXT_46:%.*]] = add nsw i32 [[VAL_46]], [[SUM_NEXT_45]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_46:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 47 -; CHECK-NEXT: [[ARRAYIDX_47:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_46]] -; CHECK-NEXT: [[VAL_47:%.*]] = load i32, ptr [[ARRAYIDX_47]], align 4 -; CHECK-NEXT: [[SUM_NEXT_47:%.*]] = add nsw i32 [[VAL_47]], [[SUM_NEXT_46]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_47:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 48 -; CHECK-NEXT: [[ARRAYIDX_48:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_47]] -; CHECK-NEXT: [[VAL_48:%.*]] = load i32, ptr [[ARRAYIDX_48]], align 4 -; CHECK-NEXT: [[SUM_NEXT_48:%.*]] = add nsw i32 [[VAL_48]], [[SUM_NEXT_47]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_48:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 49 -; CHECK-NEXT: [[ARRAYIDX_49:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_48]] -; CHECK-NEXT: [[VAL_49:%.*]] = load i32, ptr [[ARRAYIDX_49]], align 4 -; CHECK-NEXT: [[SUM_NEXT_49:%.*]] = add nsw i32 [[VAL_49]], [[SUM_NEXT_48]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_49:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 50 -; CHECK-NEXT: [[ARRAYIDX_50:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_49]] -; CHECK-NEXT: [[VAL_50:%.*]] = load i32, ptr [[ARRAYIDX_50]], align 4 -; CHECK-NEXT: [[SUM_NEXT_50:%.*]] = add nsw i32 [[VAL_50]], [[SUM_NEXT_49]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_50:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 51 -; CHECK-NEXT: [[ARRAYIDX_51:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_50]] -; CHECK-NEXT: [[VAL_51:%.*]] = load i32, ptr [[ARRAYIDX_51]], align 4 -; CHECK-NEXT: [[SUM_NEXT_51:%.*]] = add nsw i32 [[VAL_51]], [[SUM_NEXT_50]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_51:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 52 -; CHECK-NEXT: [[ARRAYIDX_52:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_51]] -; CHECK-NEXT: [[VAL_52:%.*]] = load i32, ptr [[ARRAYIDX_52]], align 4 -; CHECK-NEXT: [[SUM_NEXT_52:%.*]] = add nsw i32 [[VAL_52]], [[SUM_NEXT_51]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_52:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 53 -; CHECK-NEXT: [[ARRAYIDX_53:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_52]] -; CHECK-NEXT: [[VAL_53:%.*]] = load i32, ptr [[ARRAYIDX_53]], align 4 -; CHECK-NEXT: [[SUM_NEXT_53:%.*]] = add nsw i32 [[VAL_53]], [[SUM_NEXT_52]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_53:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 54 -; CHECK-NEXT: [[ARRAYIDX_54:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_53]] -; CHECK-NEXT: [[VAL_54:%.*]] = load i32, ptr [[ARRAYIDX_54]], align 4 -; CHECK-NEXT: [[SUM_NEXT_54:%.*]] = add nsw i32 [[VAL_54]], [[SUM_NEXT_53]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_54:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 55 -; CHECK-NEXT: [[ARRAYIDX_55:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_54]] -; CHECK-NEXT: [[VAL_55:%.*]] = load i32, ptr [[ARRAYIDX_55]], align 4 -; CHECK-NEXT: [[SUM_NEXT_55:%.*]] = add nsw i32 [[VAL_55]], [[SUM_NEXT_54]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_55:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 56 -; CHECK-NEXT: [[ARRAYIDX_56:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_55]] -; CHECK-NEXT: [[VAL_56:%.*]] = load i32, ptr [[ARRAYIDX_56]], align 4 -; CHECK-NEXT: [[SUM_NEXT_56:%.*]] = add nsw i32 [[VAL_56]], [[SUM_NEXT_55]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_56:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 57 -; CHECK-NEXT: [[ARRAYIDX_57:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_56]] -; CHECK-NEXT: [[VAL_57:%.*]] = load i32, ptr [[ARRAYIDX_57]], align 4 -; CHECK-NEXT: [[SUM_NEXT_57:%.*]] = add nsw i32 [[VAL_57]], [[SUM_NEXT_56]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_57:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 58 -; CHECK-NEXT: [[ARRAYIDX_58:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_57]] -; CHECK-NEXT: [[VAL_58:%.*]] = load i32, ptr [[ARRAYIDX_58]], align 4 -; CHECK-NEXT: [[SUM_NEXT_58:%.*]] = add nsw i32 [[VAL_58]], [[SUM_NEXT_57]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_58:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 59 -; CHECK-NEXT: [[ARRAYIDX_59:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_58]] -; CHECK-NEXT: [[VAL_59:%.*]] = load i32, ptr [[ARRAYIDX_59]], align 4 -; CHECK-NEXT: [[SUM_NEXT_59:%.*]] = add nsw i32 [[VAL_59]], [[SUM_NEXT_58]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_59:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 60 -; CHECK-NEXT: [[ARRAYIDX_60:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_59]] -; CHECK-NEXT: [[VAL_60:%.*]] = load i32, ptr [[ARRAYIDX_60]], align 4 -; CHECK-NEXT: [[SUM_NEXT_60:%.*]] = add nsw i32 [[VAL_60]], [[SUM_NEXT_59]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_60:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 61 -; CHECK-NEXT: [[ARRAYIDX_61:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_60]] -; CHECK-NEXT: [[VAL_61:%.*]] = load i32, ptr [[ARRAYIDX_61]], align 4 -; CHECK-NEXT: [[SUM_NEXT_61:%.*]] = add nsw i32 [[VAL_61]], [[SUM_NEXT_60]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_61:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 62 -; CHECK-NEXT: [[ARRAYIDX_62:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_61]] -; CHECK-NEXT: [[VAL_62:%.*]] = load i32, ptr [[ARRAYIDX_62]], align 4 -; CHECK-NEXT: [[SUM_NEXT_62:%.*]] = add nsw i32 [[VAL_62]], [[SUM_NEXT_61]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_62:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 63 -; CHECK-NEXT: [[ARRAYIDX_63:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_62]] -; CHECK-NEXT: [[VAL_63:%.*]] = load i32, ptr [[ARRAYIDX_63]], align 4 -; CHECK-NEXT: [[SUM_NEXT_63:%.*]] = add nsw i32 [[VAL_63]], [[SUM_NEXT_62]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_63:%.*]] = add nuw nsw i64 [[INDVARS_IV]], 64 -; CHECK-NEXT: [[ARRAYIDX_64:%.*]] = getelementptr inbounds i32, ptr [[ARY]], i64 [[INDVARS_IV_NEXT_63]] -; CHECK-NEXT: [[VAL_64:%.*]] = load i32, ptr [[ARRAYIDX_64]], align 4 -; CHECK-NEXT: [[SUM_NEXT_64:%.*]] = add nsw i32 [[VAL_64]], [[SUM_NEXT_63]] -; CHECK-NEXT: [[INDVARS_IV_NEXT_64:%.*]] = add nuw nsw i64 [[INDVARS_IV... [truncated] ``````````

nikic commented 5 months ago

I didn't really get why it makes sense to multiply the uop fetch rate with the mispredict penalty.

Also, while znver4 might in theory support 9 uops per cycle, isn't it limited by the 6 wide renamer in practice?

ganeshgit commented 5 months ago

isn't it limited by the 6 wide renamer in practice

I am not sure why you mention renamer. It has a role to play in the throughput however it is not strictly limited to renamer's capacity right? Renamer's capability will be on case-case basis but theoretical limit will be 9uops. Also, renamer can be split across integer\fp(vector). So, I don't think we should restrict it with renamer's capability.

@RKSimon I think we should add a tuning flag whether a subtarget is willing to use this LoopMicroOpBufferSize for unrolling decision. I agree that the metric you are proposing is serving the purpose but the term LoopMicroOpBufferSize in itself is misleading and is not representative.

RKSimon commented 5 months ago

@RKSimon I think we should add a tuning flag whether a subtarget is willing to use this LoopMicroOpBufferSize for unrolling decision. I agree that the metric you are proposing is serving the purpose but the term LoopMicroOpBufferSize in itself is misleading and is not representative.

We shouldn't need a TLI/TTI control for this - either removing the LoopMicroOpBufferSize entry (see znver1/2) or explicitly setting it to 0 has a similar effect. But I'm not certain if we want to do this for znver3/4 or not - I don't have access to hardware to test this.

RKSimon commented 5 months ago

Would people prefer we just drop the LoopMicroOpBufferSize entry from the znver3/4 models (same as znver1/2)? This prevents most loop unrolling and we then rely on the cpu's op cache higher decode rate to get higher performance (but we end up testing every loop).

ganeshgit commented 5 months ago

Would people prefer we just drop the LoopMicroOpBufferSize entry from the znver3/4 models (same as znver1/2)? This prevents most loop unrolling and we then rely on the cpu's op cache higher decode rate to get higher performance (but we end up testing every loop).

I have at least one counter example which gains with the LoopMicroOpBufferSize setting we have for znver3/4. Let us go by your deduction of the metric for LoopMicroOpBufferSize based on the misprediction penalty.

ganeshgit commented 5 months ago

matrix_vector_mul.zip With -unroll-runtime, we can see the attached example gaining. So, let us have the metric @RKSimon suggested.

RKSimon commented 5 months ago

Thanks @ganeshgit - are you happy to accept this patch as it is then?

ganeshgit commented 5 months ago

Thanks @ganeshgit - are you happy to accept this patch as it is then?

@RKSimon Sure thanks a lot! LGTM!

llvm / llvm-project

[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values #91340