rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
600 stars 265 forks source link

Implement AVX-512 intrinsics #310

Open alexcrichton opened 6 years ago

alexcrichton commented 6 years ago

General instructions for this can be found at https://github.com/rust-lang-nursery/stdsimd/issues/40, but the list of AVX-512 intrinsics is quite large! This is intended to help track progress but you'll likely want to talk to us out of band to ensure that everything is coordinated.

Intrinsic lists: https://gist.github.com/alexcrichton/3281adb58af7f465cebee49759ae3164

gnzlbg commented 5 years ago

Usually at the beginning of each LLVM and clang test file there is a comment stating how to compile the test. These comments contain the features that have to be enabled for the test to compile, so you can infer which feature you need from there.

dignifiedquire commented 5 years ago

I made some progress, but am now hanging on another llvm select failure. I described the issues as best as I could here: https://gcc.godbolt.org/z/-EjC7Y and the code that I have been working on is here: https://github.com/dignifiedquire/stdsimd/blob/feat/avx512f/crates/core_arch/src/x86/avx512f/arith/min.rs

alexcrichton commented 5 years ago

Hm this may have to do with constant arguments perhaps? Without actually knowing much about the instruction, we have a lot of intrinsics that require constant arguments elsewhere in the codebase. LLVM may fail to generate an instruction if an argument to the intrinsic is not a constant. This is what the constify_* macros are doing throughout the codebase.

This may be a case where a constify_* macro is needed perhaps? Along with a #[rustc_args_required_const] attribute as well perhaps.

dignifiedquire commented 4 years ago

@alexcrichton looked into this again, and I think the issue boils down to that constify_* is not able to make two parameters constant, which these intrinsics seem to require. Trying to nest the macro invocations, seems to explode, and probably not very deseriable, given the level of expansion this seems to imply.

Daniel-B-Smith commented 4 years ago

Coming in to this cold, I would like to be able to use _mm512_cmpgt_epu64_mask and _mm512_cmpgt_epu128_mask. I'm more than happy to put in the work to add a group fo the comparison instrinsics, but I would need some guidance to get started. Where do things stand with trying to add types like _mmask8?

Amanieu commented 4 years ago

We already have a __mmask16 type in crates/core_arch/src/x86/mod.rs. You could add __mmask8 next to it.

Daniel-B-Smith commented 4 years ago

I'm running into an issue trying to link the floating point comparison intrinsics from LLVM. The broken commit is here: https://github.com/rust-lang/stdarch/pull/869/commits/16386aed76ad370d90b324f8174b86ceea2c0399. The error I'm getting is:

LLVM ERROR: Cannot select: intrinsic %llvm.x86.avx512.cmp.ps.512

I assumed the problem was arguments being insufficiently const, but using this macro didn't fix anything: https://github.com/rust-lang/stdarch/pull/869/commits/6b389b4a855af43309050c847ca6f74b247b8222. For reference, here are the LLVM definitions. The former is what I'm using directly and the latter is what is in the error message:

declare i16 @llvm.x86.avx512.mask.cmp.ps.512(<16 x float> , <16 x float> , i32, i16, i32)
declare <16 x i1> @llvm.x86.avx512.cmp.ps.512(<16 x float>, <16 x float>, i32, i32) 
Amanieu commented 4 years ago

Have you checked to see what clang does for this intrinsic?

Amanieu commented 4 years ago

I can't reproduce your error on that commit. I get this error instead:

---- verify_all_signatures stdout ----
missing run-time test named `test_mm512_cmplt_ps_mask` for `_mm512_cmplt_ps_mask`
missing run-time test named `test_mm512_mask_cmplt_ps_mask` for `_mm512_mask_cmplt_ps_mask`
missing run-time test named `test_mm512_cmpgt_ps_mask` for `_mm512_cmpgt_ps_mask`
thread 'verify_all_signatures' panicked at 'missing intel definition for _mm512_cmpgt_ps_mask', crates/stdarch-verify/tests/x86-intel.rs:336:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Daniel-B-Smith commented 4 years ago

The error does reproduce on CI: https://github.com/rust-lang/stdarch/pull/869/checks?check_run_id=770449165

Clang implements it with codegen: https://github.com/llvm/llvm-project/blob/master/clang/lib/CodeGen/CGBuiltin.cpp#L13487 AFAIU, our existing simd codegen only handles integers, and it certainly doesn't handle rounding modes. The existing AVX intrinsics were implemented by linking in the llvm intrinsics similar to the code I added. I can certainly try my hand at adding new simd codegen to rustc, but that will probably take a minute.

Daniel-B-Smith commented 4 years ago

After additional work, I realized that the problem was that I had the constification wrong. I had tried constifying both const args reaching out but had apparently gotten it wrong. The PR isn't finished, but the comparisons are linking properly now.

minybot commented 4 years ago

Hi, I try to implement _mm512_and_epi32 in crates/core_arch/src/x86/avx512f.rs

pub unsafe fn _mm512_and_epi32(a: m512i, b: m512i) -> __m512i { let r = vpandd(a.as_i32x16(), b.as_i32x16()); transmute(r) }

[link_name = "llvm.x86.avx512.mask.pand.d.512"]

fn vpandd(a: i32x16, b: i32x16) -> i32x16;

The test is unsafe fn test_mm512_and_epi32() { let a = _mm512_set_epi32(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1); let b = _mm512_set_epi32(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1);

    let r = _mm512_and_epi32(a, b);
    let e = _mm512_set_epi32(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1);

    assert_eq_m512i(r, e);
}

When I run cargo test, it shows "(signal: 11, SIGSEGV: invalid memory reference)" Any ideas?

I tried to compile _mm512_and_epi32 with clang, and it works.

Amanieu commented 4 years ago

I suggest using a debugger to look at the disassembly of the crashing code.

minybot commented 4 years ago

Hi, I try to implement _mm512_and_epi32 in crates/core_arch/src/x86/avx512f.rs

pub unsafe fn _mm512_and_epi32(a: m512i, b: m512i) -> __m512i { let r = vpandd(a.as_i32x16(), b.as_i32x16()); transmute(r) }

Update: After modifying it to transmute(simd_and(a.as_i32x16(), b.as_i32x16())) It works.

The rustc generate vpandd instruction.

minybot commented 4 years ago

I try to implement _mm512_cvt_roundps_ph (__m512 a, int sae). The document describes: 'Exceptions can be suppressed by passing _MM_FROUND_NO_EXC in the sae parameter.' As my understanding, sae should only be '_MM_FROUND_NO_EXC (0x08)' or '_MM_FROUND_CUR_DIRECTION (0x04)' However, Clang accepts the sae parameters from 0 to 255.

Should we follow the clang or only accept 4 and 8?

Amanieu commented 4 years ago

I checked both Clang and GCC and they both pass the full 8 bits on to the underlying instruction: https://www.felixcloutier.com/x86/vcvtps2ph

minybot commented 4 years ago

I checked both Clang and GCC and they both pass the full 8 bits on to the underlying instruction: https://www.felixcloutier.com/x86/vcvtps2ph

Ok. Thanks. The document I checked is https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=512_cvt_roundps_ph&expand=1354

minybot commented 3 years ago

I try to implement _mm512_mask_extractf32x4_ps (m128 src, mmask8 k, __m512 a, int imm8) The input mask is u8. However, it uses u4 only. FOR j := 0 to 3 i := j*32 IF k[j] dst[i+31:i] := tmp[i+31:i] ELSE dst[i+31:i] := src[i+31:i] FI ENDFOR

The simd_select_bitmask(mask, extract, src) shows mismatched lengths: mask length 8 != other vector length 4.

My question is I should implement a u4 or otherwise?

Amanieu commented 3 years ago

You can just mask to keep only the bottom 4 bits and use constify_imm4.

bjorn3 commented 3 years ago

simd_select_bitmask really requires the mask to be a non-existent 4bit integer type. This is a bug in rustc.

minybot commented 3 years ago

simd_select_bitmask really requires the mask to be a non-existent 4bit integer type. This is a bug in rustc.

Is any plan to support 4bit or 2 bit integer type in the future? Clang has i1x4 to support extract with 4 bit mask. "%3 = select <4 x i1> %extract, <4 x i32> %shuffle, <4 x i32> %1"

AVX512F uses a lot of 4bit(32x4) or 2bit(64x2) masks on _mm_mask_xxxxx instructions which inputs and outputs are 128 bit.

Amanieu commented 3 years ago

I had a look in the compiler and it seems that this is a bug in the implementation of simd_select_bitmask: it should accept u8 inputs when the number of lanes is less than 8. simd_bitmask already supports this by returning u8 when the number of lanes is less than 8.

@minybot @bjorn3 Would one of you be willing to make a PR to fix this in rustc? The relevant code is here: https://github.com/rust-lang/rust/blob/f3c923a13a458c35ee26b3513533fce8a15c9c05/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1272

minybot commented 3 years ago

I had a look in the compiler and it seems that this is a bug in the implementation of simd_select_bitmask: it should accept u8 inputs when the number of lanes is less than 8. simd_bitmask already supports this by returning u8 when the number of lanes is less than 8.

I try to modify simd_select_bitmask to use 4bit mask if the output is f32x"4"

    if mask_len > output_len { mask_len = output_len; }
    ...
    let i1 = bx.type_i1();
    let i1xn = bx.type_vector(i1, mask_len);
    let m_i1s = bx.trunc(args[0].immediate(), i1xn);
    return Ok(bx.select(m_i1s, args[1].immediate(), args[2].immediate()));

However, it shows "error: failed to parse bitcode for LTO module: Bitwidth for integer type out of range (Producer: 'LLVM11.0.0-rust-dev' Reader: 'LLVM 11.0.0-rust-dev')"

So, bx.select is only accept 8bit or more?

bjorn3 commented 3 years ago

You can't truncate i8 to i1 x 4. You have to truncate to i4 and then bitcast to i1 x 4 I think.

minybot commented 3 years ago

I had a look in the compiler and it seems that this is a bug in the implementation of simd_select_bitmask: it should accept u8 inputs when the number of lanes is less than 8. simd_bitmask already supports this by returning u8 when the number of lanes is less than 8.

@minybot @bjorn3 Would one of you be willing to make a PR to fix this in rustc? The relevant code is here: https://github.com/rust-lang/rust/blob/f3c923a13a458c35ee26b3513533fce8a15c9c05/compiler/rustc_codegen_llvm/src/intrinsic.rs#L1272

There is another solution without touching simd_select_bitmask. Use cast. Take _mm512_mask_extractf32x4_ps (m128 src, mmask8 k, __m512 a, int imm8) as an example. a->(32x4); Cast to (32x16); Cast to (32x8); do bitmask; Cast to (32x4). There is no cast128_to_256 directly. only 128_to_512, 512_to_256. 512_to_128.

Amanieu commented 3 years ago

I just went ahead and fixed the issue in https://github.com/rust-lang/rust/pull/77504.

minybot commented 3 years ago

I just went ahead and fixed the issue in rust-lang/rust#77504.

I test it, and it works when the mask size is 4.

minybot commented 3 years ago

For Mask operation in avx512 such as _kadd_mask32, it adds two masks. According to https://travisdowns.github.io/blog/2019/12/05/kreg-facts.html, the Mask has its own hardware register. Is there anyway to make sure _kadd_mask32 will generate "kaddd" instruction?

Amanieu commented 3 years ago

No, but it's fine since we don't guarantee a particular instruction is used for an intrinsic: we leave it to LLVM to decide whether it is better to use a kadd instruction or a normal add instruction.

stopbystudent commented 3 years ago

While working on a private project, I needed masked loading, so I wanted to prepare a PR with implementations for _mm512_mask_load_epi32 and the like. Reading https://github.com/rust-lang/stdarch/blob/master/crates/core_arch/avx512f.md, I found the following:

  • [ ] _mm512_mask_load_epi32 //need i1
  • [ ] _mm512_maskz_load_epi32 //need i1

What is the "need i1" part? I have not found any explanation there.

Currently, I am tempted to implement masked loading like in (as an example)

/// Load packed 32-bit integers from memory into dst using writemask k (elements are copied from src when the corresponding mask bit is not set). mem_addr must be aligned on a 64-byte boundary or a general-protection exception may be generated.
///
/// [Intel's documentation](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_mask_load_epi32&expand=3305)
#[inline]
#[target_feature(enable = "avx512f")]
#[cfg_attr(test, assert_instr(vmovdqa32))]
pub unsafe fn _mm512_mask_load_epi32(src: __m512i, k: __mmask16, mem_addr: *const i32) -> __m512i {
    let loaded = ptr::read(mem_addr as *const __m512i).as_i32x16();
    let src = src.as_i32x16();
    transmute(simd_select_bitmask(k, loaded, src))
}

which follows how _mm512_maskz_mov_epi32 and _mm512_load_epi32 are implemented. If this sounds correct, I might make a PR in the next days.

Amanieu commented 3 years ago

This is incorrect since _mm512_mask_load_epi32 must not cause page faults on the parts of the vector that are masked off. Your version will still cause these page faults.

To support this properly we need to call an LLVM intrinsic directly. However this intrinsic uses a vector of i1 as argument, which we cannot represent with Rust types. We need additional support in the compiler to call LLVM intrinsics that take a vector of i1 as a parameter.

stopbystudent commented 3 years ago

Makes sense. Many thanks for the explanation.

jhorstmann commented 2 years ago

Another possible implementation for _mm512_mask_load_epi32 would using the asm feature. I have successfully used the following implementation:

#[inline]
pub unsafe fn _mm512_mask_loadu_epi32(src: __m512i, mask: __mmask16, ptr: *const i32) -> __m512i {
    let mut result: __m512i = src;

    asm!(
    "vmovdqu32 {io}{{{k}}}, [{p}]",
    p = in(reg) ptr,
    k = in(kreg) mask,
    io = inout(zmm_reg) result,
    options(nostack), options(pure), options(readonly)
    );

    result
}

If such an implementation would be ok maintenance wise I could try preparing a PR that adds the missing avx512f this way.

Amanieu commented 2 years ago

If such an implementation would be ok maintenance wise I could try preparing a PR that adds the missing avx512f this way.

Sounds good!

mert-kurttutan commented 1 month ago

Just coming from the discussion: https://github.com/rust-lang/portable-simd/issues/28.

Regarding the separation of avx512f intrinsics and and target_feature=avx512f, now, I have enough interest and time to investigate it. My particular case of interest is using zmm_reg for inline assembly (so need for avx512f intrinsics), but target_feature=avx512f is not stable yet. If it helps the stabilisation of target_feature, I am willing to work on it under some guidance. @Amanieu What do you think?

Amanieu commented 1 month ago

I expect that we will be stabilizing AVX-512 soon, thanks to the hard work of many people in implementing the full set of AVX-512 intrinsics in stdarch.