codegen issue for vectors

bartek-siudeja commented 2 years ago

It seems that ldc2 has a bit of trouble with IR generation on vectorized computations. I believe this could be connected to this issue in intel-intrinsics library: https://github.com/AuburnSounds/intel-intrinsics/issues/86 https://godbolt.org/z/aKh4rWY83

These two functions compile to the same single sqrt cpu instruction (on arm and intel targets), as expected:

auto sqrt(double2 a)
{
    a.ptr[0] = llvm_sqrt(a.array[0]);
    a.ptr[1] = llvm_sqrt(a.array[1]);
    return a;
}
auto sqrt2(double2 a)
{
    return __irEx_pure!(
        `declare <2 x double> @llvm.sqrt.v2f64(<2 x double> %Val)`,
        `%r = call <2 x double> @llvm.sqrt.v2f64(<2x double> %0)
        ret <2 x double> %r`, "", double2)(a);
}

But when I am trying to use them in another function handling partial vectors then the first one causes problems:

auto get_low(double4 d4)
{
    return __ir_pure!(
        `%r = shufflevector <4 x double> %0, <4 x double> undef,
                    <2 x i32> <i32 0, i32 1>
        ret <2 x double> %r`, double2)(d4);
}
auto sqrt(double4 d4)
{
    return sqrt(get_low(d4));
}
auto sqrt2(double4 d4)
{
    return sqrt2(get_low(d4));
}

Now we get:

pure nothrow @nogc __vector(double[2]) example.sqrt(__vector(double[4])):
        fsqrt   d1, d0
        mov     d0, v0.d[1]
        fsqrt   d0, d0
        mov     v1.d[1], v0.d[0]
        mov     v0.16b, v1.16b
        ret

pure nothrow @nogc @safe __vector(double[2]) example.sqrt2(__vector(double[4])):
        fsqrt   v0.2d, v0.2d
        ret

This issue seems quite problematic for aarch64 (or any target without 256 bit vectors), where double4 is really two double2 registers. Yet it does not seem possible to take a function that operates on double2 and apply it twice on two chunks of double4.

kinke commented 2 years ago

Note that llvm_sqrt supports vectors, so this should be as simple as:

import ldc.intrinsics : llvm_sqrt;
import core.simd;

auto sqrt2(double2 x) { return llvm_sqrt(x); }
auto sqrt4(double4 x) { return llvm_sqrt(x); }

https://run.dlang.io/is/A9OILJ

bartek-siudeja commented 2 years ago

oh, I was using sqrt just as an example.I also need log/exp and they exist in nice sse form. I was thinking about a generic template like

auto promote(alias fun)(double4 d4)
{
    double2 low = get_low(d4);
    double2 high = get_high(d4);
    low = fun(low);
    high = fun(high);
    return combine(high, low);
}

There should be no reason for ldc to destroy double2 vectors and switch to per entry operations. And this does not happen when IR is used.

bartek-siudeja commented 2 years ago

All is good if there is a vectorized intrinsic for something, but if not then we can run into something like this: https://github.com/AuburnSounds/intel-intrinsics/issues/86 https://github.com/AuburnSounds/intel-intrinsics/blob/master/source/inteli/emmintrin.d#L1222 Somehow optimizer forgets that there is a vector already and tries to dig inside the loop over vector entries.

kinke commented 2 years ago

I fail to see the problem in the generated IR for sqrt(double2). The scalar access goes through a reinterpret-as-array-then-index GEP which might trip up the optimizer, but that makes it clearly an LLVM issue IMO.

bartek-siudeja commented 2 years ago

ah, maybe this is an LLVM issue. This is quite far over my head, I am afraid. The end result for double4 is weird. How should I proceed? What would be a good LLVM IR thing to submit as LLVM issue?

clang does not have a similar issue, according to this: https://github.com/AuburnSounds/intel-intrinsics/issues/86#issuecomment-997116210

ldc-developers / ldc

codegen issue for vectors #3950