Open jetp250 opened 3 years ago
I don't understand 😕 Calling
let noise = unsafe { simdnoise::avx2::get_3d_noise(&noise_gen) };
calls (link)
#[target_feature(enable = "avx2")]
pub unsafe fn get_3d_noise(noise_type: &NoiseType) -> (Vec<f32>, f32, f32) {
noise_helpers::get_3d_noise::<Avx2>(noise_type)
}
which in turn calls this, which for gradient_3d matches GradientSettings and calls get_3d_noise_helper!
with, in this case, simplex::simplex_3d<Avx2>
:
NoiseType::Gradient(s) => get_3d_noise_helper!(s, simplex_3d::<S>, s.dim.seed),
I wanted to see how far the control flow gets inside the get_3d_noise_helper!
macro, so I copied it and modified it to not use the GradientSettings (because fields are private, can't work around that). That only required me to specify the frequency, noise start and dimensions myself, but having done that I.. didn't get the error anymore. Not even with the MSVC toolchain. Also verified I'm still using the version of the library that has the problem (there's a commented-out line at the end which I used to make sure it still overflows the stack). I'm totally missing something here.
I also looked and looked in the commit for recursive function calls and stack arrays, but no luck, I'm no wiser about what's causing this yet. I'm pretty sure it is a recursive function call though, but not sure why I keep missing it.
Same problem here.
I cannot reproduce this, (edit: oh, yes I can! I was using the crates.io release. The current git rev has this bug.) but I am concerned that this bug is still an issue. My CPU is a Ryzen 9 5900X, which supports AVX2. I'm using Windows 11 and Visual Studio 2019.
Both minimal examples with an additional println!("noise: {:#?}", noise);
just prints this, in both debug and release builds:
noise: (
[
0.0,
],
0.0,
0.0,
)
Scariest of all though - neither LLDB nor GDB support MSVC debug symbols, so I compiled the same code with the
x86_64-pc-windows-gnu
toolchain, and there the same code does work! Kind of unfortunate as that means I seem to have no way to debug the source.
FWIW, you should be able to debug the MSVC bin with WinDbg.
Analysis is provided below. The stack trace is:
# Child-SP RetAddr Call Site
00 000000b6`1a0ff0f0 00007ff7`5717170c noise_test!__chkstk(void)+0x37 [d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\misc\amd64\chkstk.asm @ 109]
01 000000b6`1a0ff108 00007ff7`570c2fee noise_test!simdnoise::avx2::get_3d_noise(union enum$<simdnoise::NoiseType> * noise_type = <Value unavailable error>)+0xc [C:\Users\jay\.cargo\git\checkouts\rust-simd-noise-59bcd94fdfcfddf8\3a4f3e6\src\avx2.rs @ 477]
02 000000b6`1a0ff120 00007ff7`570c17b4 noise_test!simdnoise::GradientSettings::generate(struct simdnoise::GradientSettings * self = 0x000000b6`1a0ff530)+0x15e [C:\Users\jay\.cargo\git\checkouts\rust-simd-noise-59bcd94fdfcfddf8\3a4f3e6\src\lib.rs @ 1026]
03 000000b6`1a0ff4b0 00007ff7`570c102b noise_test!noise_test::main(void)+0x44 [C:\Users\jay\Desktop\noise-test\src\main.rs @ 7]
04 000000b6`1a0ff5e0 00007ff7`570c12cb noise_test!core::ops::function::FnOnce::call_once<void (void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\core\src\ops\function.rs @ 227]
05 000000b6`1a0ff620 00007ff7`570c1751 noise_test!std::sys_common::backtrace::__rust_begin_short_backtrace<void (<function> * f = 0x00007ff7`570c1770)+0x1b [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\std\src\sys_common\backtrace.rs @ 126]
06 000000b6`1a0ff660 00007ff7`570c8ccf noise_test!std::rt::lang_start::closure$0<tuple$<> >(void)+0x11 [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\std\src\rt.rs @ 145]
07 (Inline Function) --------`-------- noise_test!core::ops::function::impls::impl$2::call_once(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\core\src\ops\function.rs @ 259]
08 (Inline Function) --------`-------- noise_test!std::panicking::try::do_call(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 485]
09 (Inline Function) --------`-------- noise_test!std::panicking::try(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 449]
0a (Inline Function) --------`-------- noise_test!std::panic::catch_unwind(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panic.rs @ 136]
0b (Inline Function) --------`-------- noise_test!std::rt::lang_start_internal::closure$2(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\rt.rs @ 128]
0c (Inline Function) --------`-------- noise_test!std::panicking::try::do_call(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 485]
0d (Inline Function) --------`-------- noise_test!std::panicking::try(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panicking.rs @ 449]
0e (Inline Function) --------`-------- noise_test!std::panic::catch_unwind(void)+0xb [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\panic.rs @ 136]
0f 000000b6`1a0ff6a0 00007ff7`570c171f noise_test!std::rt::lang_start_internal(void)+0x11f [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\/library\std\src\rt.rs @ 128]
10 000000b6`1a0ff780 00007ff7`570c1886 noise_test!std::rt::lang_start<tuple$<> >(<function> * main = 0x00007ff7`570c1770, int64 argc = 0n1, unsigned char ** argv = 0x0000022d`bd496f50)+0x2f [/rustc/6abb6385b2cb7249f67b9b3ce7522527767dd907\library\std\src\rt.rs @ 144]
11 000000b6`1a0ff7e0 00007ff7`572ffb4c noise_test!main+0x16
12 (Inline Function) --------`-------- noise_test!invoke_main(void)+0x22 [d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 78]
13 000000b6`1a0ff810 00007ff9`7e3854e0 noise_test!__scrt_common_main_seh(void)+0x10c [d:\a01\_work\6\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288]
14 000000b6`1a0ff850 00007ff9`7f30485b KERNEL32!BaseThreadInitThunk+0x10
15 000000b6`1a0ff880 00000000`00000000 ntdll!RtlUserThreadStart+0x2b
No surprises here, this is exactly what you already pointed out above.
Because avx2::get_3d_noise()
is not a leaf function and there is no recursion, that means the stack overflow is a result of overly aggressive inlining. (See https://codywu2010.wordpress.com/2010/10/04/__chkstk-and-stack-overflow/ which discusses the _chkstk
function and https://docs.microsoft.com/en-us/windows/win32/devnotes/-win32-chkstk) These annotations are putting more than 1 page (8 KB on 64-bit Windows) of local variables into a single stack frame which triggers the stack guard page checks. It's not that the entirety of space allocated for the stack has been consumed, just one mega-sized stack frame that the runtime rejects as an error.
For instance, this simple patch fixes the minimal test case for me:
diff --git a/src/simplex.rs b/src/simplex.rs
index 1b7c9e7..224171d 100644
--- a/src/simplex.rs
+++ b/src/simplex.rs
@@ -482,7 +482,7 @@ pub unsafe fn simplex_3d<S: Simd>(x: S::Vf32, y: S::Vf32, z: S::Vf32, seed: i32)
}
/// Like `simplex_3d`, but also computes the derivative
-#[inline(always)]
+// #[inline(always)]
pub unsafe fn simplex_3d_deriv<S: Simd>(
x: S::Vf32,
y: S::Vf32,
I was curious to see just what kind of affect these inline
annotations have, so I updated the benchmark and ran a comparison with and without the annotations:
fbm4d/scalar 4d time: [559.84 us 561.27 us 564.02 us]
change: [+7.3086% +7.9727% +8.7299%] (p = 0.00 < 0.05)
Performance has regressed.
fbm4d/sse2 4d time: [376.70 us 377.44 us 378.55 us]
change: [+12.348% +13.389% +14.533%] (p = 0.00 < 0.05)
Performance has regressed.
fbm4d/sse41 4d time: [191.96 us 192.67 us 193.38 us]
change: [+3.3190% +4.2364% +5.1170%] (p = 0.00 < 0.05)
Performance has regressed.
fbm4d/avx2 4d time: [144.58 us 144.78 us 145.10 us]
change: [+14.644% +15.349% +16.104%] (p = 0.00 < 0.05)
Performance has regressed.
fbm3d/scalar 3d time: [17.098 ms 17.119 ms 17.143 ms]
change: [+0.3933% +1.1009% +1.9784%] (p = 0.01 < 0.05)
Change within noise threshold.
fbm3d/sse2 3d time: [5.5870 ms 5.5975 ms 5.6081 ms]
change: [+8.1536% +8.8993% +9.6124%] (p = 0.00 < 0.05)
Performance has regressed.
fbm3d/sse41 3d time: [4.7534 ms 4.7637 ms 4.7759 ms]
change: [+15.874% +16.456% +17.040%] (p = 0.00 < 0.05)
Performance has regressed.
fbm3d/avx2 3d time: [2.8044 ms 2.8118 ms 2.8205 ms]
change: [+26.310% +27.090% +27.863%] (p = 0.00 < 0.05)
Performance has regressed.
fbm2d/scalar 2d time: [514.54 ms 515.40 ms 516.31 ms]
change: [-9.4867% -9.2846% -9.0827%] (p = 0.00 < 0.05)
Performance has improved.
fbm2d/sse2 2d time: [202.54 ms 202.87 ms 203.18 ms]
change: [+11.639% +11.962% +12.283%] (p = 0.00 < 0.05)
Performance has regressed.
fbm2d/sse41 2d time: [150.13 ms 150.32 ms 150.62 ms]
change: [+14.476% +15.026% +15.548%] (p = 0.00 < 0.05)
Performance has regressed.
fbm2d/avx2 2d time: [96.266 ms 96.594 ms 96.832 ms]
change: [+12.907% +13.376% +13.790%] (p = 0.00 < 0.05)
Performance has regressed.
fbm1d/scalar 1d time: [18.709 us 18.766 us 18.820 us]
change: [+7.0296% +7.8178% +8.6081%] (p = 0.00 < 0.05)
Performance has regressed.
fbm1d/sse2 1d time: [6.6314 us 6.6446 us 6.6641 us]
change: [+30.749% +31.958% +33.139%] (p = 0.00 < 0.05)
Performance has regressed.
fbm1d/sse41 1d time: [5.3461 us 5.3539 us 5.3688 us]
change: [+40.596% +41.364% +42.113%] (p = 0.00 < 0.05)
Performance has regressed.
fbm1d/avx2 1d time: [3.8761 us 3.8805 us 3.8881 us]
change: [+35.649% +36.677% +38.087%] (p = 0.00 < 0.05)
Performance has regressed.
cellular2d/scalar 2d time: [274.09 ms 274.46 ms 274.84 ms]
change: [+0.7918% +0.9703% +1.1463%] (p = 0.00 < 0.05)
Change within noise threshold.
cellular2d/sse2 2d time: [10.419 ms 10.461 ms 10.495 ms]
change: [+8.2019% +8.9270% +9.6900%] (p = 0.00 < 0.05)
Performance has regressed.
cellular2d/sse41 2d time: [8.1566 ms 8.1836 ms 8.2058 ms]
change: [+10.241% +10.956% +11.597%] (p = 0.00 < 0.05)
Performance has regressed.
cellular2d/avx2 2d time: [4.5678 ms 4.6034 ms 4.6396 ms]
change: [+15.234% +16.310% +17.391%] (p = 0.00 < 0.05)
Performance has regressed.
cellular3d/scalar 3d time: [1.6968 s 1.7002 s 1.7043 s]
change: [-0.2821% -0.0706% +0.1721%] (p = 0.58 > 0.05)
No change in performance detected.
cellular3d/sse2 3d time: [58.865 ms 58.965 ms 59.092 ms]
change: [+4.6021% +5.0566% +5.5566%] (p = 0.00 < 0.05)
Performance has regressed.
cellular3d/sse41 3d time: [44.762 ms 44.824 ms 44.932 ms]
change: [+9.2326% +9.9355% +10.525%] (p = 0.00 < 0.05)
Performance has regressed.
cellular3d/avx2 3d time: [21.976 ms 22.017 ms 22.072 ms]
change: [+4.5154% +5.0129% +5.4665%] (p = 0.00 < 0.05)
Performance has regressed.
Clearly, the inlining improves performance. There is no doubt about that. But I am concerned about a few things:
fbm1d/sse41 1d
with a +41.36% regression are dubious. 5.3539 us
is fast enough. The benchmark is 1.5639 us
slower, but this is meaningless.fbm2d/sse41 2d
, which became 20 ms
slower.20 ms
slower is probably worth actually being able to use this crate.IMHO, delete all of the #[inline(always)]
annotations (or at least audit these). Mostly just let LLVM inline functions when it knows it can.
I am pretty confident this is also the cause of #25, based on the stack trace provided in that thread.
Any attempt to generate 3D noise with AVX2 (which is selected by default, because my CPU supports AVX2) ends up in a stack overflow, no matter the number of noise values I try to generate.
The most minimalistic setup for reproducing the issue would be
on an AVX2-enabled device, or if unsupported, the problem also occurs when calling the noise function directly:
I tested several types of 3D noise, both with and without offsets and with varying dimensions, but all end up in a stack overflow - however, I couldn't find the actual source of the error despite skimming the code (looked at
get_3d_noise_helper!
especially). The issue only occurs when you actually request to generate the noise, so the noise builder I think can be ruled out :thinking:Scariest of all though - neither LLDB nor GDB support MSVC debug symbols, so I compiled the same code with the
x86_64-pc-windows-gnu
toolchain, and there the same code does work! Kind of unfortunate as that means I seem to have no way to debug the source.Environment: Windows 10 Home, version 20H2 (OS Build 19042.1165) Intel(R) Core(TM) i5-4460 CPU@ 3.20GHz rustc 1.56.0-nightly (2d2bc94c8 2021-08-15)
Cargo.toml:
(the rev specified is the current latest commit)
Tested a commit from before the update of simdeez to 1.0.7, same result, so doesn't seem like updating simdeez was the cause. Tested current crates.io release (3.1.6) and did not have the problem. Something after that then!