rust-lang / rustc_codegen_cranelift

Cranelift based backend for rustc
Apache License 2.0
1.52k stars 94 forks source link

unimplemented intrinsics for `matrixmultiply` #1405

Closed cuviper closed 7 months ago

cuviper commented 8 months ago

I have some code using ndarray dot products, which in turn calls matrixmultiply::sgemm or dgemm, and these trap when built with cranelift. Here's a reproducer:

cargo-features = ["codegen-backend"]

[package]
name = "dot"
edition = "2021"

[dependencies]
ndarray = "0.15.6"

[profile.dev]
codegen-backend = "cranelift"
#[test]
fn dot_f32() {
    let matrix = ndarray::Array2::<f32>::eye(10);
    let _ = matrix.dot(&matrix);
}

#[test]
fn dot_f64() {
    let matrix = ndarray::Array2::<f64>::eye(10);
    let _ = matrix.dot(&matrix);
}
$ cargo test
...
running 2 tests
trap at Instance { def: Item(DefId(2:14266 ~ core[53bd]::core_arch::x86::avx::_mm256_permute2f128_pd)), args: [3_i32] } (_ZN4core9core_arch3x863avx22_mm256_permute2f128_pd17h1ce919b2c8bdf956E): llvm.x86.avx.vperm2f128.pd.256
trap at Instance { def: Item(DefId(2:14264 ~ core[53bd]::core_arch::x86::avx::_mm256_permute2f128_ps)), args: [3_i32] } (_ZN4core9core_arch3x863avx22_mm256_permute2f128_ps17h5d0d25c7962691b9E): llvm.x86.avx.vperm2f128.ps.256
cuviper commented 8 months ago

FWIW, aarch64 also fails:

trap at Instance { def: Item(DefId(2:48675 ~ core[5761]::core_arch::aarch64::neon::generated::vfmaq_laneq_f32)), args: [0_i32] } (_ZN4core9core_arch7aarch644neon9generated15vfmaq_laneq_f3217h0dd8a28605cc03d6E): llvm.fma.v4f32
trap at Instance { def: Item(DefId(2:48683 ~ core[5761]::core_arch::aarch64::neon::generated::vfmaq_laneq_f64)), args: [0_i32] } (_ZN4core9core_arch7aarch644neon9generated15vfmaq_laneq_f6417ha68b5ef2dcd31872E): llvm.fma.v2f64
cuviper commented 8 months ago

Directly compiling matrixmultiply shows warnings about these intrinsics, but at least there are no more.

Aarch64:

warning: unsupported llvm intrinsic llvm.fma.v4f32; replacing with trap

warning: unsupported llvm intrinsic llvm.fma.v2f64; replacing with trap

x86_64:

warning: unsupported x86 llvm intrinsic llvm.x86.avx.vperm2f128.pd.256; replacing with trap

warning: unsupported x86 llvm intrinsic llvm.x86.avx.vperm2f128.ps.256; replacing with trap
bjorn3 commented 8 months ago

Implemented llvm.fma.v* in https://github.com/rust-lang/rustc_codegen_cranelift/commit/48ca2d9703742149aa33b3f84ae933d063213d19. On AArch64 with this fix the only remaining ndarray test failures are: insert_axis, insert_axis_f and test_multislice_intersecting. Based on the panic message for those remaining test failures I think there is a miscompilation of those tests though.

Edit: Seems those are actually tests that use catch_unwind, which doesn't work because of panic=abort.

bjorn3 commented 8 months ago

I wrote an entire comment about how I couldn't reproduce any crash on x86 and then I tried using the rustup version instead of the version built from this repo, which did indeed crash with this error message. I'm currently investigating what the difference between the two is that could have caused this.

cuviper commented 8 months ago

Ah, yes I'm using the rustup component, as of:

$ rustc +nightly -Vv
rustc 1.75.0-nightly (31bc7e2c4 2023-10-30)
binary: rustc
commit-hash: 31bc7e2c47e82798a392c770611975a6883132c8
commit-date: 2023-10-30
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
LLVM version: 17.0.3
bjorn3 commented 8 months ago

It seems like is_x86_feature_detected!() is broken when using a cg_clif compiled libstd, causing matrixmultiply to disable some tests because it thinks AVX and FMA are not supported.

bjorn3 commented 8 months ago

I think I know the issue. std_detect::detect::os::x86::detect_features depends on _xgetbv() to see if the OS supports AVX. _xgetbv is implemented using the llvm.x86.xgetbv LLVM intrinsic rather than an asm!() block. Because it isn't supported natively by Cranelift, I implemented it using a dummy value of 1.

bjorn3 commented 8 months ago

Just a quick update. I have _xgetbv correctly implemented now. I've been working on implementing _mm256_permute2f128_ps and _mm256_permute2f128_pd and got a miscompilation right now that I need to fix.

bjorn3 commented 8 months ago

Got matrixmultiply working correctly in the implement_xgetbv branch. You can download a precompiled version from https://github.com/rust-lang/rustc_codegen_cranelift/actions/runs/6763047493 once it is done. I will probably work on implementing the rest of the reported missing intrinsics from other issues before opening a PR.

bjorn3 commented 7 months ago

Should be fixed in the latest nightly.

cuviper commented 7 months ago

Confirmed, thanks!