rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
611 stars 269 forks source link

p64 load/store intrinsics not properly inlined on arm #1236

Open hkratz opened 3 years ago

hkratz commented 3 years ago

The following tests fail the inlining check on arm (but pass on aarch64):

    core_arch::arm_shared::neon::generated::assert_vld1_p64_x2_vld1
    core_arch::arm_shared::neon::generated::assert_vld1_p64_x3_nop
    core_arch::arm_shared::neon::generated::assert_vld1_p64_x4_nop
    core_arch::arm_shared::neon::generated::assert_vld1q_p64_x2_nop
    core_arch::arm_shared::neon::generated::assert_vld1q_p64_x3_nop
    core_arch::arm_shared::neon::generated::assert_vld1q_p64_x4_nop
    core_arch::arm_shared::neon::generated::assert_vld2_dup_p64_nop
    core_arch::arm_shared::neon::generated::assert_vld2_p64_nop
    core_arch::arm_shared::neon::generated::assert_vld3_dup_p64_nop
    core_arch::arm_shared::neon::generated::assert_vld3_p64_nop
    core_arch::arm_shared::neon::generated::assert_vld4_dup_p64_nop
    core_arch::arm_shared::neon::generated::assert_vld4_p64_nop
    core_arch::arm_shared::neon::generated::assert_vst1_p64_x2_vst1
    core_arch::arm_shared::neon::generated::assert_vst1_p64_x3_nop
    core_arch::arm_shared::neon::generated::assert_vst1_p64_x4_nop
    core_arch::arm_shared::neon::generated::assert_vst1q_p64_x2_nop
    core_arch::arm_shared::neon::generated::assert_vst1q_p64_x3_nop
    core_arch::arm_shared::neon::generated::assert_vst1q_p64_x4_nop
    core_arch::arm_shared::neon::generated::assert_vst2_p64_nop
    core_arch::arm_shared::neon::generated::assert_vst3_p64_nop
    core_arch::arm_shared::neon::generated::assert_vst4_p64_nop

Apparently we need an extra version of the load/store intrinsics we delegate those intrinsics to with matching feature flags, see https://godbolt.org/z/WxozeEPav

gendx commented 1 year ago

It looks like this is because the Rust compiler doesn't support inlining a function labeled with some target_feature(enable = "foo") into another function labeled with another set of enabled features (unless all the features in foo are enabled in compiler flags).

I've investigated this problem in this blog post, in particular this section on target_feature and this section on what it means for recursion. See also similar issues in https://github.com/rust-lang/rust/issues/54353 and https://github.com/rust-lang/rust/issues/53069.

Concretely, the problem is that you're trying to inline a function labeled neon,v7 within a function labeled neon,aes,v8. If I compile the first code with -C target-feature=+aes,+v7,+v8, then the intrinsic is inlined.

#[inline]  // This doesn't apply to a function labeled with other features.
#[target_feature(enable = "neon,v7")]
pub unsafe fn vld2_s64(a: *const i64) -> int64x1x2_t {
  ...
}

#[inline]
#[target_feature(enable = "neon,aes,v8"))]
pub unsafe fn vld2_p64(a: *const p64) -> poly64x1x2_t {
    transmute(vld2_s64(transmute(a)))
}

#[target_feature(enable = "neon,aes,v8"))]
#[inline(never)]
pub unsafe fn vld2_p64_testshim(ptr: *const p64) -> poly64x1x2_t {
    vld2_p64(ptr)
}