rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
605 stars 267 forks source link

Implement all ARM NEON intrinsics #148

Open gnzlbg opened 7 years ago

gnzlbg commented 7 years ago

Steps for implementing an intrinsic:

All unimplemented NEON intrinsics

gnzlbg commented 7 years ago
oconnor663 commented 5 years ago

Is there a blocker for these, or is it just finding time to do it? I'd like to help, but I'd need a more experienced compiler/SIMD person to point me in the right direction.

gnzlbg commented 5 years ago

I can mentor. Start by taking a look at some of the intrinsics in the coresimd/aarch64/neon.rs module :)

oconnor663 commented 5 years ago

Is there some upstream source that these all get copied from, or are they actually written by hand?

gnzlbg commented 5 years ago

I am not sure I understand the question ? The neon modules in this repository are written by hand, although @Amanieu has expressed interest into generating some parts of them automatically.

oconnor663 commented 5 years ago

Sorry yeah that was unclear. The other thing I wanted to ask was, what's the upstream Source Of Truth for defining these functions?

On Fri, Nov 16, 2018, 11:57 AM gnzlbg <notifications@github.com wrote:

I am not sure I understand the question ? The neon modules in this repository are written by hand, although @Amanieu https://github.com/Amanieu has expressed interest into generating some parts of them automatically.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rust-lang-nursery/stdsimd/issues/148#issuecomment-439457484, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0jBJQbXX6E4XFv28wWNuDfvUYpY6mDks5uvu5ugaJpZM4QEQB6 .

gnzlbg commented 5 years ago

Ah, I see, that would be the ARM NEON spec: https://developer.arm.com/technologies/neon/intrinsics

alexcrichton commented 5 years ago

Now might be a great time to help make some more progress on this! We've got tons of intrinsics already implemented (thanks @gnzlbg!), and I've just implemented automatic verification of all added intrinsics, so we know if they're added they've got the correct signature at least!

I've updated the OP of this issue with more detailed instructions about how to bind NEON intrinsics. Hopefully it's not too bad any more!

We'll probably want to reorganize modules so they're a bit smaller and more manageable over time, but for now if anyone's interested to add more intrinsics and needs some help let me know!

valpackett commented 5 years ago

more manageable

I have a proposal for this: using a macro to make definitions one-line e.g.:

neon_op!(binary vadd_s8 : int8x8_t == simd_add, assert vadd / add, doc "Vector add");
neon_op!(binary vaddl_s8 : int8x8_t -> int16x8_t == simd_add, assert vaddl / saddl, doc "Vector long add");
neon_op!(unary vmovn_s16 : int16x8_t -> int8x8_t == simd_cast, assert vmovn / xtn, doc "Vector narrow integer");

This will make adding new ones easier (scrolling through a bolierplate-filled file just feels awful), and I'll add a lot more simd_sub simd_mul simd_lt etc. ones. Would this be accepted?

macro definition I currently have ```rust macro_rules! neon_op { (binary $name:ident : $type:ident == $op:ident, assert $instr32:ident / $instr64:ident, doc $doc:literal) => { #[inline] #[doc = $doc] #[target_feature(enable = "neon")] #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))] #[cfg_attr(all(test, target_arch = "arm"), assert_instr($instr32))] #[cfg_attr(all(test, target_arch = "aarch64"), assert_instr($instr64))] pub unsafe fn $name(a: $type, b: $type) -> $type { $op(a, b) } }; (binary $name:ident : $type:ident -> $result_type:ident == $op:ident, assert $instr32:ident / $instr64:ident, doc $doc:literal) => { #[inline] #[doc = $doc] #[target_feature(enable = "neon")] #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))] #[cfg_attr(all(test, target_arch = "arm"), assert_instr($instr32))] #[cfg_attr(all(test, target_arch = "aarch64"), assert_instr($instr64))] pub unsafe fn $name(a: $type, b: $type) -> $result_type { let a: $result_type = simd_cast(a); let b: $result_type = simd_cast(b); $op(a, b) } }; (unary $name:ident : $type:ident -> $result_type:ident == $op:ident, assert $instr32:ident / $instr64:ident, doc $doc:literal) => { #[inline] #[doc = $doc] #[target_feature(enable = "neon")] #[cfg_attr(target_arch = "arm", target_feature(enable = "v7"))] #[cfg_attr(all(test, target_arch = "arm"), assert_instr($instr32))] #[cfg_attr(all(test, target_arch = "aarch64"), assert_instr($instr64))] pub unsafe fn $name(a: $type) -> $result_type { $op(a) } }; } ```
gnzlbg commented 5 years ago

For the definitions, I think that using macros is ok.

I am not sure I follow how does macros generate run-time tests for the intrinsics, that's usually the bulk of the work.

aloucks commented 4 years ago

What is the reasoning behind some intrinsics linking in the LLVM intrinsic directly while others are using the generic simd_XXX functions?

For example:

https://github.com/rust-lang/stdarch/blob/a3710694f94f13758799996e981dd2cc1069a514/crates/core_arch/src/arm/neon/generated.rs#L1416-L1430

Versus:

https://github.com/rust-lang/stdarch/blob/a3710694f94f13758799996e981dd2cc1069a514/crates/core_arch/src/aarch64/neon/generated.rs#L12-L18

Given the sheer volume of neon intrinsics, it seems rather daunting to implement them all by hand using the guide in the first post. I'm wondering if there's a deterministic data driven way to generate all of them using #[link_name = "llvm.*"] as done in the first example. Maybe the llvm c headers could be useful?

bjorn3 commented 4 years ago

What is the reasoning behind some intrinsics linking in the LLVM intrinsic directly while others are using the generic simd_XXX functions?

Not all intrinsics have a corresponding simd_* platform-intrinsic.

I'm wondering if there's a deterministic data driven way to generate all of them using #[link_name = "llvm.*"] as done in the first example. Maybe the llvm c headers could be useful?

Please don't. The simd_* platform intrinsics are much easier to implement in alternative codegen backends than the llvm intrinsics, as they are generic over vector types and they are backend independent.

alexcrichton commented 4 years ago

@aloucks most of the intrinsics (AFAIK) have been added piecemeal over time, so it's sort of expected that they're not 100% consistent. Otherwise though I'd imagine that whatever works best would be fine to add to this repository. Auto-generation sounds pretty reasonable to me, and for an implementation we strive to match what Clang does in its implementation of these intrinsics.

alexcrichton commented 4 years ago

Also, to be clear, this library is not designed for ease of implementation in alternate codegen backends. The purpose of this crate is to get the LLVM backend up and running with SIMD. Discussions and design constraints for alternate backends should be discussed in a separate issue.

Lokathor commented 4 years ago

Hey all, some friends and I have made a google sheet of all the Neon intrinsics, their inputs, output, and the ARM summary comment.

There could easily have been errors when copying around and manipulating thousands of entries of text, but I think that it's got all the bugs sorted out.

If you want to try some auto-generation, this is a good place to start. There's even a column where I've marked what we have in nightly so far, so if you just auto-gen all the functions that aren't checked you shouldn't hit any duplicate definitions.

I hope to find the time to actually contribute some functions, but for now this will have to do.

EDIT: also I just subscribed to the entire repo, so if there's any PRs that add more functions I'll try to check those boxes on the sheet and keep it up to date.

Lokathor commented 4 years ago

Working with @shnatsel, I described the "godbolt process" and they were kind enough to make it a bash script that you can run locally

#!/bin/bash
set -e
INTRINSIC_NAME="$1"
TEMP_DIR="$(mktemp -d)"
cleanup() {
    rm -r "$TEMP_DIR"
}
trap cleanup EXIT
(
cd "$TEMP_DIR"
echo "#include <arm_neon.h>
int test() {
  return (int) $INTRINSIC_NAME;
}" > ./in.c

clang -emit-llvm -O2 -S -target armv7-unknown-linux-gnueabihf -g0 in.c
ARM_NAME=$(grep --only-matching '@llvm.arm.neon.[A-Za-z0-9.]\+' ./*.ll | tr -d '@' | head -n 1)

clang -emit-llvm -O2 -S -target aarch64-unknown-linux-gnu -g0 in.c
AARCH64_NAME=$(grep --only-matching '@llvm.aarch64.neon.[A-Za-z0-9.]\+' ./*.ll | tr -d '@' | head -n 1)

echo "$INTRINSIC_NAME, $ARM_NAME, $AARCH64_NAME"
)

You will probably need the gcc-multilib package or similar installed so that the correct headers are available.

Note that many functions don't have an associated llvm intrinsic that can be as easily scrapped out this way, but maybe 1/4th or so of them do.

SparrowLii commented 3 years ago

@Lokathor Several instructions have been added recently: vaddhn, vbic, vorn, vceqz, vtst, vabd, vaba. Though some of them are not fully supported( like vceqzd). If you don’t have time to maintain this google sheet, I think I can help

nano-bot commented 3 years ago

@Lokathor Several instructions have been added recently: vaddhn, vbic, vorn, vceqz, vtst, vabd, vaba. Though some of them are not fully supported( like vceqzd). If you don’t have time to maintain this google sheet, I think I can help

Awesome, looking forward to this!

fzyzcjy commented 3 years ago

Any updates after a long time...? Thanks

bjorn3 commented 3 years ago

If you look at the pull request list you can see that there has been activity on this quite recently. For example https://github.com/rust-lang/stdarch/pull/1224 was opened yesterday.

fzyzcjy commented 3 years ago

@bjorn3 Thanks! Indeed I mostly want to know when can we see it in stable version. By the way do you suggest use nightly in production environment? If so I can use it now.

CryZe commented 3 years ago

@SparrowLii You marked the following instructions as completed (same for min):

https://i.imgur.com/OipjDCy.png

It doesn't seem like those instructions are actually part of your recent PR (nor were they on the master branch before that) so I unmarked them again.

SparrowLii commented 3 years ago

@CryZe They can be found in the master branch now: https://github.com/rust-lang/stdarch/blob/master/crates/core_arch/src/aarch64/neon/generated.rs#L8519-L8539 https://github.com/rust-lang/stdarch/blob/master/crates/core_arch/src/aarch64/neon/generated.rs#L8545-L8565 Sorry I marked them before #1230 merged, this is to prevent others from submitting duplicate PRs

CryZe commented 3 years ago

Welp, I'll mark them again then. Somehow the GitHub Pull Request UI doesn't show them as diffs at all: https://i.imgur.com/BsHR5in.gif

SparrowLii commented 3 years ago

Github’s comparison tool will always have problems when changing a large amount of code XD

SparrowLii commented 3 years ago

As in #1230, except for the following instructions and those use 16-bit floating-point, other instructions have been implemented:

  1. The following instructions are only available in aarch64 now, because the corresponding target_feature cannot be found in the available features of arm: vcadd_rotvcmlavdot

  2. The feature i8mm is not valid: vmmlavusmmla: https://rust.godbolt.org/z/8GbKW5ef4

  3. LLVM ERROR(Can be reproduced in godbolt): vsm4e: https://rust.godbolt.org/z/xhT1xvGTP

  4. LLVM ERROR(Normal in gotbolt, but LLVM ERROR: Cannot select: intrinsic raises at runtime) vsudotvusdot: https://rust.godbolt.org/z/aMnEvab3n vqshlu: https://rust.godbolt.org/z/hvGhrhdMT

  5. Not implmented in LLVM and cannot be implemented manually: vmull_p64(for arm)、vsm3vrax1q_u64vxarq_u64vrnd32vrnd64vsha512

Amanieu commented 3 years ago

As in #1230, except for the following instructions and those use 16-bit floating-point, other instructions have been implemented:

1. The following instructions are only available in aarch64 now, because the corresponding `target_feature` cannot be found in the available features of arm:
   `vcadd_rot`、`vcmla`、`vdot`

On LLVM's ARM backend, vcadd_rot and vcmla are under the v8.3a feature. vdot is under the dotprod feature. I got this information from llvm-project/llvm/lib/Target/ARM/ARMInstrNEON.td.

2. The feature `i8mm` is not valid:
   `vmmla`、`vusmmla`: [rust.godbolt.org/z/8GbKW5ef4](https://rust.godbolt.org/z/8GbKW5ef4)

Already discussed in https://github.com/rust-lang/rust/pull/90079.

3. LLVM ERROR(Can be reproduced in godbolt):
   `vsm4e`: [rust.godbolt.org/z/xhT1xvGTP](https://rust.godbolt.org/z/xhT1xvGTP)

Use llvm.aarch64.crypto.sm4ekey instead of llvm.aarch64.sve.sm4ekey.

4. LLVM ERROR(Normal in gotbolt, but `LLVM ERROR: Cannot select: intrinsic` raises at runtime)
   `vsudot`、`vusdot`: [rust.godbolt.org/z/aMnEvab3n](https://rust.godbolt.org/z/aMnEvab3n)
   `vqshlu`: [rust.godbolt.org/z/hvGhrhdMT](https://rust.godbolt.org/z/hvGhrhdMT)

You need to make you test function pub in godbolt, otherwise it will be optimized away as unreachable by rustc before LLVM.

vsudot/vusdot require the i8mm target feature. vqshlu seems to work fine in godbolt after changing the pub.

5. Not implmented in LLVM and cannot be implemented manually:
   `vmull_p64`(for arm)、`vsm3`、`vrax1q_u64`、`vxarq_u64`、`vrnd32`、`vrnd64`、`vsha512`

These all seem to exist in LLVM at least for AArch64. For ARM we can just leave these out for now.

SparrowLii commented 3 years ago

Hope someone can help implement the remaining instructions.

SparrowLii commented 2 years ago

@Amanieu v8.5a feature is non-runtime detected so we can't use #[simd_test(enable = "neon,v8.5a")]. So how do we add tests for instructions that use v8.5a, like vrnd32x and vrnd64x?

hkratz commented 2 years ago

@SparrowLii Shouldn't that work with the frintts feature?

SparrowLii commented 2 years ago

@SparrowLii Shouldn't that work with the frintts feature?

Looks useful: https://rust.godbolt.org/z/894W8cndG

Amanieu commented 2 years ago

LLVM only supports frintts on AArch64, so it's fine to not support this intrinsic on ARM.