llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.38k stars 12.15k forks source link

Don't hoist NOT's out of loop that will be folded into ANDN/ORN anyway (except sometimes on Arm) #108840

Open Validark opened 2 months ago

Validark commented 2 months ago

Godbolt link

This code:

export fn foo(a: u64, b: u64) u64 {
    var s = a | b;
    var ret: @TypeOf(s) = 0;

    while (true) {
        const iter = s;
        ret |= iter;
        s &= ~((iter +% (iter << 1)) | ((iter << 2) & ~a));
        if (s == 0) break;
    }

    return ret;
}

Results in this emit for Zen 4:

foo:
        or      rsi, rdi
        not     rdi
        xor     eax, eax
.LBB0_1:
        lea     rdx, [4*rsi]
        lea     rcx, [rsi + 2*rsi]
        or      rax, rsi
        and     rdx, rdi; we could have just used `andn`
        or      rdx, rcx
        andn    rsi, rdx, rsi
        jne     .LBB0_1
        ret

As you can see, we hoist not rdi out of the loop, even though we could have used andn. The same situation happens to the Sifive x280 (aggressive unrolling disabled via size-optimized build option):

foo:
        mv      a2, a0
        li      a0, 0
        or      a1, a1, a2
        not     a2, a2
.LBB0_1:
        slli    a3, a1, 2
        sh1add  a4, a1, a1
        and     a3, a3, a2; could have used `andn`
        or      a0, a0, a1
        or      a3, a3, a4
        andn    a1, a1, a3
        bnez    a1, .LBB0_1
        ret

However, on the Apple M3, it actually does make sense to hoist mvn out of the loop in this case, because we can do and x11, x8, x9, lsl #2 but we can't do bic x11, x8, x9, lsl #2 (I assume).

Apple M3 emit:

foo:
        mov     x8, x0
        mov     x0, #0
        orr     x9, x1, x8
        mvn     x8, x8
.LBB0_1:
        orr     x0, x9, x0
        add     x10, x9, x9, lsl #1
        and     x11, x8, x9, lsl #2
        orr     x10, x11, x10
        bics    x9, x9, x10
        b.ne    .LBB0_1
        ret
llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Niles Salter (Validark)

[Godbolt link](https://zig.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXABx8BBAKoBnTAAUAHpwAMvAFYTStJg1AAvPMFJL6yAngGVG6AMKpaAVxYM9DgDJ4GmADl3ACNMYgkuUgAHVAVCWwZnNw89GLibAV9/IJZQ8K5Iy0xrBKECJmICJPdPQswrDIYyioIswJCwiItyyuqUwp7Wv3bczoKASgtUV2Jkdg5MVRjKgGoqBjXUVAgmEBXXADYAFlIV4L3Do/H945WAUgB2ACE7jQBBFc%2BVgDcKlYV7gBmAAiKyY9wejjOd0BL3eXx%2Bf2ImAIewAAgAVACeUUwAHkqBAFNcYaCNDC4W9Xh8vgB3BB0TArCAEYiuTAk57UhEItAMBQEFaEMJA0EKCncnmfZGCx6OUlCghhCXwqX/e4AJgOCoAflBhcRNS8NQBWZkGoHywFQrjjTlQ/VKw0wq1QjUkrUrHVMO0qmlSvBUZkA0kKjTXYLIpgAaz9CMewOpkulKJmGxlfoTHEmtE4Jt4ng4WlIqE4AC0zP9prMmXcNYCeKRUUXs5NoyATRoAHQATg0Gg1GoeB1NkgeXBNkVzHCOvBYEn7pELxdLHF4ChAGibmmzpDgsBgiBQqBYUUZZAoEDQp/PIGMrNcDGjfDoTo3EHOLdIwT8FSxnEbH9mGILE8WCbRimbRtrzYQQ8QYWh/y/LBglcYBHDEWgN24XgsBYQxgHEZC8GREpvkwbDi0WYpXCVADeD8JVp2LWg8EjP9nCwHcm2IPB5xwyYqAMYAFAANTwTBaTxXFC0bfhBBEMR2C4A4ZEERQVHUL9dA1fQCLvMx9DYjdIEmVAokabCAFo8RWAAlepMCYJQADFnMFKyemAFEwQqZAEGOKzWIYVxVBWKyWGQKJXFJJgoiiegAH0WEBXhUHI4heKwEyIEmIoSjsCAHD6Twh1IHxhhyPIQAeaJYniAQSpAMq0gahg2iqzpavyxpml6Fwama7rHIKppBg6jpwm6wYmrKgUWgm0Ypry6s5gkHM8wLbjVxWUxgBWbsuCOLsNCsrBvi7Cc6yedAjmCZBjg0QFJGZXBCBITUG3GXhmy0O1SAQJysHCXLSHbI4exOnseweB56weI4uA1SQTX0ThZ1Iech17Y4JwOB4e0kLgCaR3TlzSzh103bcW0mfcj2vM96AvShGdve82SfF9aDfShP2LIC/3o79fxAsCIOsYWYMYAh4MQ7iULQjDaCw4W8IIoji3wUibHIyjeGo5BaPmRtGPqbjWPYkDOPmYtWT4%2BjBOEsSJKkmThfk4RRHEFS1PkJQ1G43Qp30vajOCHKzIshJrNshz6GczA3IFcKvJ81RJAORLAuC0Lwsi6LSWMBgMox9Kwiyij4DykbGnsBgnAG/pyobxbqsiVrGiajv6saNuxgsWvShmpvakHhph4WyrJr0eb%2BuSMe56GbIZ9tKYZjWtfp3zJdts4XbK0O47TvOy6TWu277se57XvwIhnXrW0fp3f7AaYYHKA2mc5xAQFu2HSQkgNAPDHAcHsGojgHFRuTEslMLDU1%2BruemEAkBs2ZuQVmJ4madA5o%2BZ8NAeZhHfPzXggsQLCzIaBcCkEpYnlgrLBCSEtaYFQuhTC2FGzqyMJrXCJFIJ4D1txQ2xthZm2YrwS2xAOIYFtj9Xi/EeBOyYCJcSklpKMA9rIRSPtVKew0oHbSv89JGAMuYS2kcSzRwELHeyjlE5CHwirVOFRvKCl4goZA3wc5%2BDzhFKKMUQRxBoORRKqhkYrgypXCxPUEj10bgvCQukKoryWhIE4ncEjd10hkzI09UlHXHvwgQfUqij1nkPYp408nt26C0butTKj93yCtDeykv47xgTtPaB1LrHzOpgC6V0NQ3Tug9I4T0XoQDevfT6T8aZ/S/hjeciMuxageE9E0gIIGSB7COMme81zwK3Ig/64NIYaGhrDeGiNkao2nKlXeX5VzP1pl/DUP8uCLk6XAk5kwMpxDsEcIAA) This code: ```zig export fn foo(a: u64, b: u64) u64 { var s = a | b; var ret: @TypeOf(s) = 0; while (true) { const iter = s; ret |= iter; s &= ~((iter +% (iter << 1)) | ((iter << 2) & ~a)); if (s == 0) break; } return ret; } ``` Results in this emit for Zen 4: ```asm foo: or rsi, rdi not rdi xor eax, eax .LBB0_1: lea rdx, [4*rsi] lea rcx, [rsi + 2*rsi] or rax, rsi and rdx, rdi; we could have just used `andn` or rdx, rcx andn rsi, rdx, rsi jne .LBB0_1 ret ``` As you can see, we hoist `not rdi` out of the loop, even though we could have used `andn`. The same situation happens to the Sifive x280 (aggressive unrolling disabled via size-optimized build option): ```asm foo: mv a2, a0 li a0, 0 or a1, a1, a2 not a2, a2 .LBB0_1: slli a3, a1, 2 sh1add a4, a1, a1 and a3, a3, a2; could have used `andn` or a0, a0, a1 or a3, a3, a4 andn a1, a1, a3 bnez a1, .LBB0_1 ret ``` However, on the Apple M3, it actually does make sense to hoist `mvn` out of the loop in this case, because we can do `and x11, x8, x9, lsl #2` but we can't do `bic x11, x8, x9, lsl #2` (I assume). Apple M3 emit: ```asm foo: mov x8, x0 mov x0, #0 orr x9, x1, x8 mvn x8, x8 .LBB0_1: orr x0, x9, x0 add x10, x9, x9, lsl #1 and x11, x8, x9, lsl #2 orr x10, x11, x10 bics x9, x9, x10 b.ne .LBB0_1 ret ```
llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-risc-v

Author: Niles Salter (Validark)

[Godbolt link](https://zig.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXABx8BBAKoBnTAAUAHpwAMvAFYTStJg1AAvPMFJL6yAngGVG6AMKpaAVxYM9DgDJ4GmADl3ACNMYgkuUgAHVAVCWwZnNw89GLibAV9/IJZQ8K5Iy0xrBKECJmICJPdPQswrDIYyioIswJCwiItyyuqUwp7Wv3bczoKASgtUV2Jkdg5MVRjKgGoqBjXUVAgmEBXXADYAFlIV4L3Do/H945WAUgB2ACE7jQBBFc%2BVgDcKlYV7gBmAAiKyY9wejjOd0BL3eXx%2Bf2ImAIewAAgAVACeUUwAHkqBAFNcYaCNDC4W9Xh8vgB3BB0TArCAEYiuTAk57UhEItAMBQEFaEMJA0EKCncnmfZGCx6OUlCghhCXwqX/e4AJgOCoAflBhcRNS8NQBWZkGoHywFQrjjTlQ/VKw0wq1QjUkrUrHVMO0qmlSvBUZkA0kKjTXYLIpgAaz9CMewOpkulKJmGxlfoTHEmtE4Jt4ng4WlIqE4AC0zP9prMmXcNYCeKRUUXs5NoyATRoAHQATg0Gg1GoeB1NkgeXBNkVzHCOvBYEn7pELxdLHF4ChAGibmmzpDgsBgiBQqBYUUZZAoEDQp/PIGMrNcDGjfDoTo3EHOLdIwT8FSxnEbH9mGILE8WCbRimbRtrzYQQ8QYWh/y/LBglcYBHDEWgN24XgsBYQxgHEZC8GREpvkwbDi0WYpXCVADeD8JVp2LWg8EjP9nCwHcm2IPB5xwyYqAMYAFAANTwTBaTxXFC0bfhBBEMR2C4A4ZEERQVHUL9dA1fQCLvMx9DYjdIEmVAokabCAFo8RWAAlepMCYJQADFnMFKyemAFEwQqZAEGOKzWIYVxVBWKyWGQKJXFJJgoiiegAH0WEBXhUHI4heKwEyIEmIoSjsCAHD6Twh1IHxhhyPIQAeaJYniAQSpAMq0gahg2iqzpavyxpml6Fwama7rHIKppBg6jpwm6wYmrKgUWgm0Ypry6s5gkHM8wLbjVxWUxgBWbsuCOLsNCsrBvi7Cc6yedAjmCZBjg0QFJGZXBCBITUG3GXhmy0O1SAQJysHCXLSHbI4exOnseweB56weI4uA1SQTX0ThZ1Iech17Y4JwOB4e0kLgCaR3TlzSzh103bcW0mfcj2vM96AvShGdve82SfF9aDfShP2LIC/3o79fxAsCIOsYWYMYAh4MQ7iULQjDaCw4W8IIoji3wUibHIyjeGo5BaPmRtGPqbjWPYkDOPmYtWT4%2BjBOEsSJKkmThfk4RRHEFS1PkJQ1G43Qp30vajOCHKzIshJrNshz6GczA3IFcKvJ81RJAORLAuC0Lwsi6LSWMBgMox9Kwiyij4DykbGnsBgnAG/pyobxbqsiVrGiajv6saNuxgsWvShmpvakHhph4WyrJr0eb%2BuSMe56GbIZ9tKYZjWtfp3zJdts4XbK0O47TvOy6TWu277se57XvwIhnXrW0fp3f7AaYYHKA2mc5xAQFu2HSQkgNAPDHAcHsGojgHFRuTEslMLDU1%2BruemEAkBs2ZuQVmJ4madA5o%2BZ8NAeZhHfPzXggsQLCzIaBcCkEpYnlgrLBCSEtaYFQuhTC2FGzqyMJrXCJFIJ4D1txQ2xthZm2YrwS2xAOIYFtj9Xi/EeBOyYCJcSklpKMA9rIRSPtVKew0oHbSv89JGAMuYS2kcSzRwELHeyjlE5CHwirVOFRvKCl4goZA3wc5%2BDzhFKKMUQRxBoORRKqhkYrgypXCxPUEj10bgvCQukKoryWhIE4ncEjd10hkzI09UlHXHvwgQfUqij1nkPYp408nt26C0butTKj93yCtDeykv47xgTtPaB1LrHzOpgC6V0NQ3Tug9I4T0XoQDevfT6T8aZ/S/hjeciMuxageE9E0gIIGSB7COMme81zwK3Ig/64NIYaGhrDeGiNkao2nKlXeX5VzP1pl/DUP8uCLk6XAk5kwMpxDsEcIAA) This code: ```zig export fn foo(a: u64, b: u64) u64 { var s = a | b; var ret: @TypeOf(s) = 0; while (true) { const iter = s; ret |= iter; s &= ~((iter +% (iter << 1)) | ((iter << 2) & ~a)); if (s == 0) break; } return ret; } ``` Results in this emit for Zen 4: ```asm foo: or rsi, rdi not rdi xor eax, eax .LBB0_1: lea rdx, [4*rsi] lea rcx, [rsi + 2*rsi] or rax, rsi and rdx, rdi; we could have just used `andn` or rdx, rcx andn rsi, rdx, rsi jne .LBB0_1 ret ``` As you can see, we hoist `not rdi` out of the loop, even though we could have used `andn`. The same situation happens to the Sifive x280 (aggressive unrolling disabled via size-optimized build option): ```asm foo: mv a2, a0 li a0, 0 or a1, a1, a2 not a2, a2 .LBB0_1: slli a3, a1, 2 sh1add a4, a1, a1 and a3, a3, a2; could have used `andn` or a0, a0, a1 or a3, a3, a4 andn a1, a1, a3 bnez a1, .LBB0_1 ret ``` However, on the Apple M3, it actually does make sense to hoist `mvn` out of the loop in this case, because we can do `and x11, x8, x9, lsl #2` but we can't do `bic x11, x8, x9, lsl #2` (I assume). Apple M3 emit: ```asm foo: mov x8, x0 mov x0, #0 orr x9, x1, x8 mvn x8, x8 .LBB0_1: orr x0, x9, x0 add x10, x9, x9, lsl #1 and x11, x8, x9, lsl #2 orr x10, x11, x10 bics x9, x9, x10 b.ne .LBB0_1 ret ```
llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

[Godbolt link](https://zig.godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXABx8BBAKoBnTAAUAHpwAMvAFYTStJg1AAvPMFJL6yAngGVG6AMKpaAVxYM9DgDJ4GmADl3ACNMYgkuUgAHVAVCWwZnNw89GLibAV9/IJZQ8K5Iy0xrBKECJmICJPdPQswrDIYyioIswJCwiItyyuqUwp7Wv3bczoKASgtUV2Jkdg5MVRjKgGoqBjXUVAgmEBXXADYAFlIV4L3Do/H945WAUgB2ACE7jQBBFc%2BVgDcKlYV7gBmAAiKyY9wejjOd0BL3eXx%2Bf2ImAIewAAgAVACeUUwAHkqBAFNcYaCNDC4W9Xh8vgB3BB0TArCAEYiuTAk57UhEItAMBQEFaEMJA0EKCncnmfZGCx6OUlCghhCXwqX/e4AJgOCoAflBhcRNS8NQBWZkGoHywFQrjjTlQ/VKw0wq1QjUkrUrHVMO0qmlSvBUZkA0kKjTXYLIpgAaz9CMewOpkulKJmGxlfoTHEmtE4Jt4ng4WlIqE4AC0zP9prMmXcNYCeKRUUXs5NoyATRoAHQATg0Gg1GoeB1NkgeXBNkVzHCOvBYEn7pELxdLHF4ChAGibmmzpDgsBgiBQqBYUUZZAoEDQp/PIGMrNcDGjfDoTo3EHOLdIwT8FSxnEbH9mGILE8WCbRimbRtrzYQQ8QYWh/y/LBglcYBHDEWgN24XgsBYQxgHEZC8GREpvkwbDi0WYpXCVADeD8JVp2LWg8EjP9nCwHcm2IPB5xwyYqAMYAFAANTwTBaTxXFC0bfhBBEMR2C4A4ZEERQVHUL9dA1fQCLvMx9DYjdIEmVAokabCAFo8RWAAlepMCYJQADFnMFKyemAFEwQqZAEGOKzWIYVxVBWKyWGQKJXFJJgoiiegAH0WEBXhUHI4heKwEyIEmIoSjsCAHD6Twh1IHxhhyPIQAeaJYniAQSpAMq0gahg2iqzpavyxpml6Fwama7rHIKppBg6jpwm6wYmrKgUWgm0Ypry6s5gkHM8wLbjVxWUxgBWbsuCOLsNCsrBvi7Cc6yedAjmCZBjg0QFJGZXBCBITUG3GXhmy0O1SAQJysHCXLSHbI4exOnseweB56weI4uA1SQTX0ThZ1Iech17Y4JwOB4e0kLgCaR3TlzSzh103bcW0mfcj2vM96AvShGdve82SfF9aDfShP2LIC/3o79fxAsCIOsYWYMYAh4MQ7iULQjDaCw4W8IIoji3wUibHIyjeGo5BaPmRtGPqbjWPYkDOPmYtWT4%2BjBOEsSJKkmThfk4RRHEFS1PkJQ1G43Qp30vajOCHKzIshJrNshz6GczA3IFcKvJ81RJAORLAuC0Lwsi6LSWMBgMox9Kwiyij4DykbGnsBgnAG/pyobxbqsiVrGiajv6saNuxgsWvShmpvakHhph4WyrJr0eb%2BuSMe56GbIZ9tKYZjWtfp3zJdts4XbK0O47TvOy6TWu277se57XvwIhnXrW0fp3f7AaYYHKA2mc5xAQFu2HSQkgNAPDHAcHsGojgHFRuTEslMLDU1%2BruemEAkBs2ZuQVmJ4madA5o%2BZ8NAeZhHfPzXggsQLCzIaBcCkEpYnlgrLBCSEtaYFQuhTC2FGzqyMJrXCJFIJ4D1txQ2xthZm2YrwS2xAOIYFtj9Xi/EeBOyYCJcSklpKMA9rIRSPtVKew0oHbSv89JGAMuYS2kcSzRwELHeyjlE5CHwirVOFRvKCl4goZA3wc5%2BDzhFKKMUQRxBoORRKqhkYrgypXCxPUEj10bgvCQukKoryWhIE4ncEjd10hkzI09UlHXHvwgQfUqij1nkPYp408nt26C0butTKj93yCtDeykv47xgTtPaB1LrHzOpgC6V0NQ3Tug9I4T0XoQDevfT6T8aZ/S/hjeciMuxageE9E0gIIGSB7COMme81zwK3Ig/64NIYaGhrDeGiNkao2nKlXeX5VzP1pl/DUP8uCLk6XAk5kwMpxDsEcIAA) This code: ```zig export fn foo(a: u64, b: u64) u64 { var s = a | b; var ret: @TypeOf(s) = 0; while (true) { const iter = s; ret |= iter; s &= ~((iter +% (iter << 1)) | ((iter << 2) & ~a)); if (s == 0) break; } return ret; } ``` Results in this emit for Zen 4: ```asm foo: or rsi, rdi not rdi xor eax, eax .LBB0_1: lea rdx, [4*rsi] lea rcx, [rsi + 2*rsi] or rax, rsi and rdx, rdi; we could have just used `andn` or rdx, rcx andn rsi, rdx, rsi jne .LBB0_1 ret ``` As you can see, we hoist `not rdi` out of the loop, even though we could have used `andn`. The same situation happens to the Sifive x280 (aggressive unrolling disabled via size-optimized build option): ```asm foo: mv a2, a0 li a0, 0 or a1, a1, a2 not a2, a2 .LBB0_1: slli a3, a1, 2 sh1add a4, a1, a1 and a3, a3, a2; could have used `andn` or a0, a0, a1 or a3, a3, a4 andn a1, a1, a3 bnez a1, .LBB0_1 ret ``` However, on the Apple M3, it actually does make sense to hoist `mvn` out of the loop in this case, because we can do `and x11, x8, x9, lsl #2` but we can't do `bic x11, x8, x9, lsl #2` (I assume). Apple M3 emit: ```asm foo: mov x8, x0 mov x0, #0 orr x9, x1, x8 mvn x8, x8 .LBB0_1: orr x0, x9, x0 add x10, x9, x9, lsl #1 and x11, x8, x9, lsl #2 orr x10, x11, x10 bics x9, x9, x10 b.ne .LBB0_1 ret ```