verpeteren / rust-simd-noise

SIMD noise library for Rust
https://crates.io/crates/simdnoise
250 stars 20 forks source link

Stack overflow with 3D noise functions with AVX2 (Windows) #38

Open jetp250 opened 3 years ago

jetp250 commented 3 years ago

Any attempt to generate 3D noise with AVX2 (which is selected by default, because my CPU supports AVX2) ends up in a stack overflow, no matter the number of noise values I try to generate.

The most minimalistic setup for reproducing the issue would be

use simdnoise::NoiseBuilder;

fn main() {
    let noise_gen = NoiseBuilder::gradient_3d(1,  1,  1).generate();
}

on an AVX2-enabled device, or if unsupported, the problem also occurs when calling the noise function directly:

use simdnoise::NoiseBuilder;

fn main() {
    let settings = NoiseBuilder::gradient_3d(1,  1,  1).wrap();
    let noise = unsafe { simdnoise::avx2::get_3d_noise(&settings) };
}

I tested several types of 3D noise, both with and without offsets and with varying dimensions, but all end up in a stack overflow - however, I couldn't find the actual source of the error despite skimming the code (looked at get_3d_noise_helper! especially). The issue only occurs when you actually request to generate the noise, so the noise builder I think can be ruled out :thinking:

Scariest of all though - neither LLDB nor GDB support MSVC debug symbols, so I compiled the same code with the x86_64-pc-windows-gnu toolchain, and there the same code does work! Kind of unfortunate as that means I seem to have no way to debug the source.


Environment: Windows 10 Home, version 20H2 (OS Build 19042.1165) Intel(R) Core(TM) i5-4460 CPU@ 3.20GHz rustc 1.56.0-nightly (2d2bc94c8 2021-08-15)

Cargo.toml:

[package]
name = "issue-testing"
version = "0.1.0"
edition = "2018"

[dependencies]
simdnoise = { git = "https://github.com/jackmott/rust-simd-noise", rev = "3a4f3e6f79608616b6ee186dc665b601d015dc1e" }

(the rev specified is the current latest commit)

Tested a commit from before the update of simdeez to 1.0.7, same result, so doesn't seem like updating simdeez was the cause. Tested current crates.io release (3.1.6) and did not have the problem. Something after that then!

jetp250 commented 3 years ago

I tracked the first commit to cause the issue to this. Investigating

jetp250 commented 3 years ago

I don't understand 😕 Calling

let noise = unsafe { simdnoise::avx2::get_3d_noise(&noise_gen) };

calls (link)

#[target_feature(enable = "avx2")]
pub unsafe fn get_3d_noise(noise_type: &NoiseType) -> (Vec<f32>, f32, f32) {
    noise_helpers::get_3d_noise::<Avx2>(noise_type)
}

which in turn calls this, which for gradient_3d matches GradientSettings and calls get_3d_noise_helper! with, in this case, simplex::simplex_3d<Avx2>:

NoiseType::Gradient(s) => get_3d_noise_helper!(s, simplex_3d::<S>, s.dim.seed),

I wanted to see how far the control flow gets inside the get_3d_noise_helper! macro, so I copied it and modified it to not use the GradientSettings (because fields are private, can't work around that). That only required me to specify the frequency, noise start and dimensions myself, but having done that I.. didn't get the error anymore. Not even with the MSVC toolchain. Also verified I'm still using the version of the library that has the problem (there's a commented-out line at the end which I used to make sure it still overflows the stack). I'm totally missing something here.

I also looked and looked in the commit for recursive function calls and stack arrays, but no luck, I'm no wiser about what's causing this yet. I'm pretty sure it is a recursive function call though, but not sure why I keep missing it.

Blatko1 commented 3 years ago

Same problem here.

parasyte commented 2 years ago

I cannot reproduce this, (edit: oh, yes I can! I was using the crates.io release. The current git rev has this bug.) but I am concerned that this bug is still an issue. My CPU is a Ryzen 9 5900X, which supports AVX2. I'm using Windows 11 and Visual Studio 2019.

Both minimal examples with an additional println!("noise: {:#?}", noise); just prints this, in both debug and release builds:

noise: (
    [
        0.0,
    ],
    0.0,
    0.0,
)

Scariest of all though - neither LLDB nor GDB support MSVC debug symbols, so I compiled the same code with the x86_64-pc-windows-gnu toolchain, and there the same code does work! Kind of unfortunate as that means I seem to have no way to debug the source.

FWIW, you should be able to debug the MSVC bin with WinDbg.


Analysis is provided below. The stack trace is:

 # Child-SP          RetAddr               Call Site
00 000000b6`1a0ff0f0 00007ff7`5717170c     noise_test!__chkstk(void)+0x37 [d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\misc\amd64\chkstk.asm @ 109] 
01 000000b6`1a0ff108 00007ff7`570c2fee     noise_test!simdnoise::avx2::get_3d_noise(union enum$<simdnoise::NoiseType> * noise_type = <Value unavailable error>)+0xc [C:\Users\jay\.cargo\git\checkouts\rust-simd-noise-59bcd94fdfcfddf8\3a4f3e6\src\avx2.rs @ 477] 
02 000000b6`1a0ff120 00007ff7`570c17b4     noise_test!simdnoise::GradientSettings::generate(struct simdnoise::GradientSettings * self = 0x000000b6`1a0ff530)+0x15e [C:\Users\jay\.cargo\git\checkouts\rust-simd-noise-59bcd94fdfcfddf8\3a4f3e6\src\lib.rs @ 1026] 
03 000000b6`1a0ff4b0 00007ff7`570c102b     noise_test!noise_test::main(void)+0x44 [C:\Users\jay\Desktop\noise-test\src\main.rs @ 7] 
04 000000b6`1a0ff5e0 00007ff7`570c12cb     noise_test!core::ops::function::FnOnce::call_once<void (void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\core\src\ops\function.rs @ 227] 
05 000000b6`1a0ff620 00007ff7`570c1751     noise_test!std::sys_common::backtrace::__rust_begin_short_backtrace<void (<function> * f = 0x00007ff7`570c1770)+0x1b [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\std\src\sys_common\backtrace.rs @ 126] 
06 000000b6`1a0ff660 00007ff7`570c8ccf     noise_test!std::rt::lang_start::closure$0<tuple$<> >(void)+0x11 [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\std\src\rt.rs @ 145] 
07 (Inline Function) --------`--------     noise_test!core::ops::function::impls::impl$2::call_once(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\core\src\ops\function.rs @ 259] 
08 (Inline Function) --------`--------     noise_test!std::panicking::try::do_call(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 485] 
09 (Inline Function) --------`--------     noise_test!std::panicking::try(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 449] 
0a (Inline Function) --------`--------     noise_test!std::panic::catch_unwind(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panic.rs @ 136] 
0b (Inline Function) --------`--------     noise_test!std::rt::lang_start_internal::closure$2(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\rt.rs @ 128] 
0c (Inline Function) --------`--------     noise_test!std::panicking::try::do_call(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 485] 
0d (Inline Function) --------`--------     noise_test!std::panicking::try(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 449] 
0e (Inline Function) --------`--------     noise_test!std::panic::catch_unwind(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panic.rs @ 136] 
0f 000000b6`1a0ff6a0 00007ff7`570c171f     noise_test!std::rt::lang_start_internal(void)+0x11f [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\rt.rs @ 128] 
10 000000b6`1a0ff780 00007ff7`570c1886     noise_test!std::rt::lang_start<tuple$<> >(<function> * main = 0x00007ff7`570c1770, int64 argc = 0n1, unsigned char ** argv = 0x0000022d`bd496f50)+0x2f [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\std\src\rt.rs @ 144] 
11 000000b6`1a0ff7e0 00007ff7`572ffb4c     noise_test!main+0x16
12 (Inline Function) --------`--------     noise_test!invoke_main(void)+0x22 [d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 78] 
13 000000b6`1a0ff810 00007ff9`7e3854e0     noise_test!__scrt_common_main_seh(void)+0x10c [d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
14 000000b6`1a0ff850 00007ff9`7f30485b     KERNEL32!BaseThreadInitThunk+0x10
15 000000b6`1a0ff880 00000000`00000000     ntdll!RtlUserThreadStart+0x2b

No surprises here, this is exactly what you already pointed out above.

Because avx2::get_3d_noise() is not a leaf function and there is no recursion, that means the stack overflow is a result of overly aggressive inlining. (See https://codywu2010.wordpress.com/2010/10/04/__chkstk-and-stack-overflow/ which discusses the _chkstk function and https://docs.microsoft.com/en-us/windows/win32/devnotes/-win32-chkstk) These annotations are putting more than 1 page (8 KB on 64-bit Windows) of local variables into a single stack frame which triggers the stack guard page checks. It's not that the entirety of space allocated for the stack has been consumed, just one mega-sized stack frame that the runtime rejects as an error.

For instance, this simple patch fixes the minimal test case for me:

diff --git a/src/simplex.rs b/src/simplex.rs
index 1b7c9e7..224171d 100644
--- a/src/simplex.rs
+++ b/src/simplex.rs
@@ -482,7 +482,7 @@ pub unsafe fn simplex_3d<S: Simd>(x: S::Vf32, y: S::Vf32, z: S::Vf32, seed: i32)
 }

 /// Like `simplex_3d`, but also computes the derivative
-#[inline(always)]
+// #[inline(always)]
 pub unsafe fn simplex_3d_deriv<S: Simd>(
     x: S::Vf32,
     y: S::Vf32,

I was curious to see just what kind of affect these inline annotations have, so I updated the benchmark and ran a comparison with and without the annotations:

Benchmark patch ```patch diff --git a/benches/benchmarks.rs b/benches/benchmarks.rs index 2f714f5..b4c2ae7 100644 --- a/benches/benchmarks.rs +++ b/benches/benchmarks.rs @@ -1,142 +1,145 @@ -extern crate criterion; -extern crate simdeez; -extern crate simdnoise; use criterion::*; use simdnoise::*; use std::time::Duration; fn d4(c: &mut Criterion) { let setting = NoiseBuilder::fbm_4d(8, 8, 8, 8).wrap(); - c.bench( - "fbm4d", - Benchmark::new("scalar 4d", move |b| { - b.iter(|| unsafe { scalar::get_4d_noise(&setting) }) - }) - .with_function("sse2 4d", move |b| { - b.iter(|| unsafe { sse2::get_4d_noise(&setting) }) - }) - .with_function("sse41 4d", move |b| { - b.iter(|| unsafe { sse41::get_4d_noise(&setting) }) - }) - .with_function("avx2 4d", move |b| { - b.iter(|| unsafe { avx2::get_4d_noise(&setting) }) - }) + let mut group = c.benchmark_group("fbm4d"); + + group .sample_size(10) .warm_up_time(Duration::from_millis(1)) - .measurement_time(Duration::from_secs(5)), - ); + .measurement_time(Duration::from_secs(5)); + group.bench_function("scalar 4d", |b| { + b.iter(|| unsafe { scalar::get_4d_noise(&setting) }) + }); + group.bench_function("sse2 4d", |b| { + b.iter(|| unsafe { sse2::get_4d_noise(&setting) }) + }); + group.bench_function("sse41 4d", |b| { + b.iter(|| unsafe { sse41::get_4d_noise(&setting) }) + }); + group.bench_function("avx2 4d", |b| { + b.iter(|| unsafe { avx2::get_4d_noise(&setting) }) + }); + group.finish(); } fn d3(c: &mut Criterion) { let setting = NoiseBuilder::fbm_3d(64, 64, 64).wrap(); - c.bench( - "fbm3d", - Benchmark::new("scalar 3d", move |b| { - b.iter(|| unsafe { scalar::get_3d_noise(&setting) }) - }) - .with_function("sse2 3d", move |b| { - b.iter(|| unsafe { sse2::get_3d_noise(&setting) }) - }) - .with_function("sse41 3d", move |b| { - b.iter(|| unsafe { sse41::get_3d_noise(&setting) }) - }) - .with_function("avx2 3d", move |b| { - b.iter(|| unsafe { avx2::get_3d_noise(&setting) }) - }) + let mut group = c.benchmark_group("fbm3d"); + + group .sample_size(10) .warm_up_time(Duration::from_millis(1)) - .measurement_time(Duration::from_secs(5)), - ); + .measurement_time(Duration::from_secs(5)); + group.bench_function("scalar 3d", |b| { + b.iter(|| unsafe { scalar::get_3d_noise(&setting) }) + }); + group.bench_function("sse2 3d", |b| { + b.iter(|| unsafe { sse2::get_3d_noise(&setting) }) + }); + group.bench_function("sse41 3d", |b| { + b.iter(|| unsafe { sse41::get_3d_noise(&setting) }) + }); + group.bench_function("avx2 3d", |b| { + b.iter(|| unsafe { avx2::get_3d_noise(&setting) }) + }); + group.finish(); } fn d2(c: &mut Criterion) { let setting = NoiseBuilder::fbm_2d(3840, 2160).wrap(); - c.bench( - "fbm2d", - Benchmark::new("scalar 2d", move |b| { - b.iter(|| unsafe { scalar::get_2d_noise(&setting) }) - }) - .with_function("sse2 2d", move |b| { - b.iter(|| unsafe { sse2::get_2d_noise(&setting) }) - }) - .with_function("sse41 2d", move |b| { - b.iter(|| unsafe { sse41::get_2d_noise(&setting) }) - }) - .with_function("avx2 2d", move |b| { - b.iter(|| unsafe { avx2::get_2d_noise(&setting) }) - }) + let mut group = c.benchmark_group("fbm2d"); + + group .sample_size(10) .warm_up_time(Duration::from_millis(1)) - .measurement_time(Duration::from_secs(5)), - ); + .measurement_time(Duration::from_secs(5)); + group.bench_function("scalar 2d", |b| { + b.iter(|| unsafe { scalar::get_2d_noise(&setting) }) + }); + group.bench_function("sse2 2d", |b| { + b.iter(|| unsafe { sse2::get_2d_noise(&setting) }) + }); + group.bench_function("sse41 2d", |b| { + b.iter(|| unsafe { sse41::get_2d_noise(&setting) }) + }); + group.bench_function("avx2 2d", |b| { + b.iter(|| unsafe { avx2::get_2d_noise(&setting) }) + }); + group.finish(); } fn d1(c: &mut Criterion) { let setting = NoiseBuilder::fbm_1d(1024).wrap(); - c.bench( - "fbm1d", - Benchmark::new("scalar 1d", move |b| { - b.iter(|| unsafe { scalar::get_1d_noise(&setting) }) - }) - .with_function("sse2 1d", move |b| { - b.iter(|| unsafe { sse2::get_1d_noise(&setting) }) - }) - .with_function("sse41 1d", move |b| { - b.iter(|| unsafe { sse41::get_1d_noise(&setting) }) - }) - .with_function("avx2 1d", move |b| { - b.iter(|| unsafe { avx2::get_1d_noise(&setting) }) - }) + let mut group = c.benchmark_group("fbm1d"); + + group .sample_size(10) .warm_up_time(Duration::from_millis(1)) - .measurement_time(Duration::from_secs(5)), - ); + .measurement_time(Duration::from_secs(5)); + group.bench_function("scalar 1d", |b| { + b.iter(|| unsafe { scalar::get_1d_noise(&setting) }) + }); + group.bench_function("sse2 1d", |b| { + b.iter(|| unsafe { sse2::get_1d_noise(&setting) }) + }); + group.bench_function("sse41 1d", |b| { + b.iter(|| unsafe { sse41::get_1d_noise(&setting) }) + }); + group.bench_function("avx2 1d", |b| { + b.iter(|| unsafe { avx2::get_1d_noise(&setting) }) + }); + group.finish(); } fn d2_cell(c: &mut Criterion) { let setting = NoiseBuilder::cellular_2d(1024, 1024) .with_return_type(CellReturnType::CellValue) .wrap(); - c.bench( - "cellular2d", - Benchmark::new("scalar 2d", move |b| { - b.iter(|| unsafe { scalar::get_2d_noise(&setting) }) - }) - .with_function("sse2 2d", move |b| { - b.iter(|| unsafe { sse2::get_2d_noise(&setting) }) - }) - .with_function("sse41 2d", move |b| { - b.iter(|| unsafe { sse41::get_2d_noise(&setting) }) - }) - .with_function("avx2 2d", move |b| { - b.iter(|| unsafe { avx2::get_2d_noise(&setting) }) - }) + let mut group = c.benchmark_group("cellular2d"); + + group .sample_size(10) .warm_up_time(Duration::from_millis(1)) - .measurement_time(Duration::from_secs(5)), - ); + .measurement_time(Duration::from_secs(5)); + group.bench_function("scalar 2d", |b| { + b.iter(|| unsafe { scalar::get_2d_noise(&setting) }) + }); + group.bench_function("sse2 2d", |b| { + b.iter(|| unsafe { sse2::get_2d_noise(&setting) }) + }); + group.bench_function("sse41 2d", |b| { + b.iter(|| unsafe { sse41::get_2d_noise(&setting) }) + }); + group.bench_function("avx2 2d", |b| { + b.iter(|| unsafe { avx2::get_2d_noise(&setting) }) + }); + group.finish(); } fn d3_cell(c: &mut Criterion) { let setting = NoiseBuilder::cellular_3d(128, 128, 128).wrap(); - c.bench( - "cellular3d", - Benchmark::new("scalar 3d", move |b| { - b.iter(|| unsafe { scalar::get_3d_noise(&setting) }) - }) - .with_function("sse2 3d", move |b| { - b.iter(|| unsafe { sse2::get_3d_noise(&setting) }) - }) - .with_function("sse41 3d", move |b| { - b.iter(|| unsafe { sse41::get_3d_noise(&setting) }) - }) - .with_function("avx2 3d", move |b| { - b.iter(|| unsafe { avx2::get_3d_noise(&setting) }) - }) + let mut group = c.benchmark_group("cellular3d"); + + group .sample_size(10) .warm_up_time(Duration::from_millis(1)) - .measurement_time(Duration::from_secs(5)), - ); + .measurement_time(Duration::from_secs(5)); + group.bench_function("scalar 3d", |b| { + b.iter(|| unsafe { scalar::get_3d_noise(&setting) }) + }); + group.bench_function("sse2 3d", |b| { + b.iter(|| unsafe { sse2::get_3d_noise(&setting) }) + }); + group.bench_function("sse41 3d", |b| { + b.iter(|| unsafe { sse41::get_3d_noise(&setting) }) + }); + group.bench_function("avx2 3d", |b| { + b.iter(|| unsafe { avx2::get_3d_noise(&setting) }) + }); + group.finish(); } criterion_group!(benches, d4, d3, d2, d1, d2_cell, d3_cell); ```
fbm4d/scalar 4d         time:   [559.84 us 561.27 us 564.02 us]
                        change: [+7.3086% +7.9727% +8.7299%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm4d/sse2 4d           time:   [376.70 us 377.44 us 378.55 us]
                        change: [+12.348% +13.389% +14.533%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm4d/sse41 4d          time:   [191.96 us 192.67 us 193.38 us]
                        change: [+3.3190% +4.2364% +5.1170%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm4d/avx2 4d           time:   [144.58 us 144.78 us 145.10 us]
                        change: [+14.644% +15.349% +16.104%] (p = 0.00 < 0.05)
                        Performance has regressed.

fbm3d/scalar 3d         time:   [17.098 ms 17.119 ms 17.143 ms]
                        change: [+0.3933% +1.1009% +1.9784%] (p = 0.01 < 0.05)
                        Change within noise threshold.
fbm3d/sse2 3d           time:   [5.5870 ms 5.5975 ms 5.6081 ms]
                        change: [+8.1536% +8.8993% +9.6124%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm3d/sse41 3d          time:   [4.7534 ms 4.7637 ms 4.7759 ms]
                        change: [+15.874% +16.456% +17.040%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm3d/avx2 3d           time:   [2.8044 ms 2.8118 ms 2.8205 ms]
                        change: [+26.310% +27.090% +27.863%] (p = 0.00 < 0.05)
                        Performance has regressed.

fbm2d/scalar 2d         time:   [514.54 ms 515.40 ms 516.31 ms]
                        change: [-9.4867% -9.2846% -9.0827%] (p = 0.00 < 0.05)
                        Performance has improved.
fbm2d/sse2 2d           time:   [202.54 ms 202.87 ms 203.18 ms]
                        change: [+11.639% +11.962% +12.283%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm2d/sse41 2d          time:   [150.13 ms 150.32 ms 150.62 ms]
                        change: [+14.476% +15.026% +15.548%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm2d/avx2 2d           time:   [96.266 ms 96.594 ms 96.832 ms]
                        change: [+12.907% +13.376% +13.790%] (p = 0.00 < 0.05)
                        Performance has regressed.

fbm1d/scalar 1d         time:   [18.709 us 18.766 us 18.820 us]
                        change: [+7.0296% +7.8178% +8.6081%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm1d/sse2 1d           time:   [6.6314 us 6.6446 us 6.6641 us]
                        change: [+30.749% +31.958% +33.139%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm1d/sse41 1d          time:   [5.3461 us 5.3539 us 5.3688 us]
                        change: [+40.596% +41.364% +42.113%] (p = 0.00 < 0.05)
                        Performance has regressed.
fbm1d/avx2 1d           time:   [3.8761 us 3.8805 us 3.8881 us]
                        change: [+35.649% +36.677% +38.087%] (p = 0.00 < 0.05)
                        Performance has regressed.

cellular2d/scalar 2d    time:   [274.09 ms 274.46 ms 274.84 ms]
                        change: [+0.7918% +0.9703% +1.1463%] (p = 0.00 < 0.05)
                        Change within noise threshold.
cellular2d/sse2 2d      time:   [10.419 ms 10.461 ms 10.495 ms]
                        change: [+8.2019% +8.9270% +9.6900%] (p = 0.00 < 0.05)
                        Performance has regressed.
cellular2d/sse41 2d     time:   [8.1566 ms 8.1836 ms 8.2058 ms]
                        change: [+10.241% +10.956% +11.597%] (p = 0.00 < 0.05)
                        Performance has regressed.
cellular2d/avx2 2d      time:   [4.5678 ms 4.6034 ms 4.6396 ms]
                        change: [+15.234% +16.310% +17.391%] (p = 0.00 < 0.05)
                        Performance has regressed.

cellular3d/scalar 3d    time:   [1.6968 s 1.7002 s 1.7043 s]
                        change: [-0.2821% -0.0706% +0.1721%] (p = 0.58 > 0.05)
                        No change in performance detected.
cellular3d/sse2 3d      time:   [58.865 ms 58.965 ms 59.092 ms]
                        change: [+4.6021% +5.0566% +5.5566%] (p = 0.00 < 0.05)
                        Performance has regressed.
cellular3d/sse41 3d     time:   [44.762 ms 44.824 ms 44.932 ms]
                        change: [+9.2326% +9.9355% +10.525%] (p = 0.00 < 0.05)
                        Performance has regressed.
cellular3d/avx2 3d      time:   [21.976 ms 22.017 ms 22.072 ms]
                        change: [+4.5154% +5.0129% +5.4665%] (p = 0.00 < 0.05)
                        Performance has regressed.

Clearly, the inlining improves performance. There is no doubt about that. But I am concerned about a few things:

  1. The sample size is really small, so results like fbm1d/sse41 1d with a +41.36% regression are dubious. 5.3539 us is fast enough. The benchmark is 1.5639 us slower, but this is meaningless.
  2. In terms of absolute magnitude, the biggest regression looks like it was fbm2d/sse41 2d, which became 20 ms slower.
  3. I would prefer the noise computation in my app to not cause an unavoidable crash, but maybe that's just me. It honestly feels like this is a showstopper and making one of the SSE4.1 benchmarks 20 ms slower is probably worth actually being able to use this crate.

IMHO, delete all of the #[inline(always)] annotations (or at least audit these). Mostly just let LLVM inline functions when it knows it can.

parasyte commented 2 years ago

I am pretty confident this is also the cause of #25, based on the stack trace provided in that thread.