amd-fftw: problems with linking and performance

jeroen-mostert commented 6 days ago

Follow-up from #74 with a more narrow focus.

On my system (Manjaro, latest rolling) building amd-fftw results in a broken libfftw3.so library: attempting to dlopen it causes a segfault. This can be fixed by changing the build so that the --enable-dynamic-dispatcher option is not passed, but while this produces a working library, it results in what appears to be a badly or completely unoptimized code path. For comparison, the results of a benchmark of this build on a Ryzen 5 7600:

> builddir/020_02_amd_fftw_double_precision/tests/bench -I -opatient 64 128 256 512 1024 2048 4096

(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.2.0 Build 20240622")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -mavx2 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128 -mfma")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 41.08 ms, time: 416.05 ns, ``mflops'': 4614.8727
Problem: 128, setup: 99.77 ms, time: 1.15 us, ``mflops'': 3889.3769
Problem: 256, setup: 217.85 ms, time: 2.60 us, ``mflops'': 3942.3856
Problem: 512, setup: 427.52 ms, time: 5.58 us, ``mflops'': 4125.3646
Problem: 1024, setup: 874.12 ms, time: 12.23 us, ``mflops'': 4185.2638
Problem: 2048, setup: 1.98 s, time: 26.82 us, ``mflops'': 4199.1903
Problem: 4096, setup: 4.71 s, time: 58.95 us, ``mflops'': 4168.7358

Now compare this to a benchmark of vanilla fftw-3.3.10 built from source with GCC 14.1.1 with configure --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --enable-openmp --enable-shared:

(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(cc "gcc -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 39.41 ms, time: 36.59 ns, ``mflops'': 52475.262
Problem: 128, setup: 98.67 ms, time: 68.36 ns, ``mflops'': 65536
Problem: 256, setup: 209.93 ms, time: 144.46 ns, ``mflops'': 70883.405
Problem: 512, setup: 371.50 ms, time: 337.13 ns, ``mflops'': 68342.058
Problem: 1024, setup: 585.39 ms, time: 802.37 ns, ``mflops'': 63811.106
Problem: 2048, setup: 876.51 ms, time: 2.01 us, ``mflops'': 56134.985
Problem: 4096, setup: 1.34 s, time: 5.14 us, ``mflops'': 47839.224

These are both single CPU tests. It's obvious that the supposedly optimized AMD build is unfortunately nothing of the sort.

That raises two questions:

Is this only due to omitting --enable-dynamic-dispatcher, or is it a more basic issue? As setting this gives me a library that won't load at all I can't test this.
How much faster is the AMD optimized version of fftw really supposed to be, can someone verify? It's good to know if it's worth going through the hassle to get it to build, as opposed to just using the system fftw. If it doesn't consistently give big wins it's at least worth considering making it optional to avoid any issues.

lamikr commented 6 days ago

Like commented on issue 74, I was finally able to reproduce the issue on Ubuntu 24.04 and your finding fixed it there. I tried to also change the ulimit settings and that was at least not helping.

One possibility is that the root cause is somekind of alignment bug that happens when compiling binaries on certain platforms. I suggest that we resolve the issue on 74 by removing the dynamic-dispatcher option but then keep this bug open.

jeroen-mostert commented 6 days ago

We can do that as a stopgap, sure. My thinking was that if we can't show that amd-fftw is bringing any actual advantage, or (as I've observed) is actively making things worse, then disabling it entirely would bypass all of this without the need for further analysis.

jeroen-mostert commented 6 days ago

Of note is that AMD itself only mentions this targets EPYC. It's unclear how much benefit consumer hardware gets (it's the same micro-arch, so you'd expect at least some benefits, but still, if it optimizes for the cache sizes on EPYC this could be a detriment). Also, this specifically targets CPU optimizations, nothing to do with GPU, so I'd assume this is not relevant to anyone using an Intel CPU + AMD GPU combo (minority or not). All in all this means I think this library has something to prove if it wants to be part of an SDK.

I've rebuilt the source AMD supplies from scratch (outside ROCm) with GCC, clang and hipcc, and while this eliminates the optimization problem, the resulting code is at best on par with the vanilla code, at least on my Ryzen 5 7600 and double precision. That said, I'm no expert in optimizing numerics, so it might be that there are modes that do get huge gains -- but if so I don't know how to bench them.

jeroen-mostert commented 5 days ago

The reason the library ends up unoptimized when building in ROCm is simply because we set CFLAGS to add includes; unfortunately, the way fftw works, these CFLAGS end up overwriting the base project's CFLAGS rather than appending to it (the AMD specific optimizations applied with --enable-amd-opts do get appended, but since this is missing the crucial -O3 that doesn't help much).

We should double check that all the other projects that use autoconf (including quite fundamental ones like OpenMPI) are not similarly affected and devise some sort of general fix, if possible. Appending things like -O3 and -mtune=arch to the CFLAGS for amd-fftw specifically works, of course, but this means the project doesn't get to define its own flags, which seems undesirable. Nor can we assume that such heavy optimizations are safe for every project. If we're only using it to pass includes then a general fix is to make sure that for configure-based projects we can pass these using --includedir, as that's what it's there for.

Unfortunately building the library with optimization, while reducing the size a bit, does not fix the segfault-on-dynamic-dispatch issue. Could have been nice. :P

jeroen-mostert commented 5 days ago

In other news, I've found the root cause for the segfault: the AMD version of the lib uses ifuncs (recently of xz backdoor fame :P) to switch between AVX implementations at runtime (but only for two functions, oddly enough -- some very specific optimization going on there, or maybe experimental stuff). Unfortunately the way they use them is incompatible with loading the library using RTLD_NOW (which can happen in any number of ways, and in fact this may be the result of strengthening patches after the xz debacle), as this forces the ifunc resolvers to run at startup time and the ifuncs in question use external symbols that have not been resolved at that time (which causes a segfault), as well as static variables that haven't had a chance to get initialized (which will not fault, but cause the optimization to never apply).

This only happens if the library is loaded with immediate resolving; if loaded lazily, no segfault will happen. However, ifuncs really ought to be written in such a way that they're safe either way, and ideally they're not used at all, for these and other reasons. My current "fix" consists of crudely making static copies of the have_simd_avx and have_simd_avx512 functions, which fixes the segfault, confirming the issue. I will cogitate some more on ideas for proper fixes, which will likely need to go upstream. For the ROCm SDK specifically, we should be good with the dynamic dispatcher fix for now, since the worst that can happen is a slight perf regression, and since nobody up to now has noticed we're actually running without any optimizations, I think we're good on that front. :P

lamikr commented 5 days ago

I tried to use the clang instead of gcc (CC=clang CXX=clang++) and then I had linking errors related to same ifuncs. It seems that clang optimized the static functions that were referenced by the ifuncs away and that causes linker failures which does not occur when build with gcc.

jeroen-mostert commented 5 days ago

The clang errors are a separate thing, in that it apparently uses implicit ifuncs as a result of __attribute__((target_clones ..)). This goes right almost everywhere except where it doesn't, for unclear reasons. These implicit ifuncs the compiler defines should always be safe, in that they'll never call functions they're not supposed to call, and they should never get optimized out, since that defeats the whole purpose. This may be a clang bug. I wonder how frequently AMD builds their own stuff. :P

jeroen-mostert commented 4 days ago

Quick thought while I was struggling with this beast: I propose we simply build the library with gcc and CFLAGS="-O3 -fomit-frame-pointer -march=native -malign-double -fstrict-aliasing -fno-schedule-insns" (rather than the default -mtune=native) and no --enable-dynamic-dispatcher. This avoids all the dynamic runtime lookup nonsense (we can leave fixing that to upstream, for those who care) and builds a library that's simply optimized for the running system, which is what we're doing anyway (at least for the GPU), unless I'm missing some use case. In my tests this, at long last, actually succeeds at producing a library that's faster than just building the native fftw (albeit marginally, no more than 5% in the best case for the default benchmark). The only remaining issue is that the library will still not be optimized for Intel users, obviously. But then again, hopefully, hipFFT/rocFFT is going to be used for GPU offloading where possible anyway. Thoughts?

lamikr commented 4 days ago

I agree with you that the current patch in master for dropping the --enable-dynamic-dispatcher is fine as with it the system works on all Linux distros. And also the addition of "-O3" is important. On mageia I was able to benchmark all combinations and also the fftw-3.10 and it's hard to see the benefit of this dynamic-dispatcher at least on my 6900hs laptop. (It would be interesting to run these tests on epyc)

But I think mtune=native is fine and no mtune=native is needed, results between those 2 seemed to quite same for me. (Can you double check) Some people may want to run the system on different machine than where it's build and that may give more incompatibilies.

Anyway, here are the results from 1 run on each machine. I did just run the tests without reboots or temperature control between runs. When running multiple times, results can vary some percentages, so these are more like a guidelines than 100% scientific.

1) fftw-3.10

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(cc "gcc -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 43.91 ms, time: 44.86 ns, ``mflops'': 42799.02
Problem: 128, setup: 108.97 ms, time: 89.56 ns, ``mflops'': 50021.515
Problem: 256, setup: 231.00 ms, time: 190.40 ns, ``mflops'': 53781.747
Problem: 512, setup: 405.67 ms, time: 476.59 ns, ``mflops'': 48343.134
Problem: 1024, setup: 634.14 ms, time: 1.01 us, ``mflops'': 50585.588
Problem: 2048, setup: 957.68 ms, time: 2.65 us, ``mflops'': 42558.199
Problem: 4096, setup: 1.47 s, time: 6.52 us, ``mflops'': 37667.75```

2) amd-fftw with enable-dynamic-dispatcher and without O3

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 46.64 ms, time: 457.12 ns, ``mflops'': 4200.1843
Problem: 128, setup: 110.45 ms, time: 1.32 us, ``mflops'': 3390.0018
Problem: 256, setup: 235.03 ms, time: 2.91 us, ``mflops'': 3516.3514
Problem: 512, setup: 465.78 ms, time: 6.56 us, ``mflops'': 3514.5181
Problem: 1024, setup: 943.84 ms, time: 14.37 us, ``mflops'': 3563.6759
Problem: 2048, setup: 2.12 s, time: 31.82 us, ``mflops'': 3540.3118
Problem: 4096, setup: 5.06 s, time: 69.58 us, ``mflops'': 3531.9463

3) amd-fftw with enable-dynamic-dispatcher and with O3 (CFLAGS="$CFLAGS: -O3" added to top of configure file)

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include: -O3 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 43.91 ms, time: 41.45 ns, ``mflops'': 46324.572
Problem: 128, setup: 110.08 ms, time: 86.20 ns, ``mflops'': 51969.427
Problem: 256, setup: 235.29 ms, time: 185.65 ns, ``mflops'': 55156.459
Problem: 512, setup: 415.35 ms, time: 408.81 ns, ``mflops'': 56358.22
Problem: 1024, setup: 659.19 ms, time: 997.31 ns, ``mflops'': 51337.87
Problem: 2048, setup: 995.22 ms, time: 2.57 us, ``mflops'': 43827.628
Problem: 4096, setup: 1.55 s, time: 6.82 us, ``mflops'': 36018.068

4) amd-fftw without enable-dynamic-dispatcher and with O3 (CFLAGS="$CFLAGS: -O3" added to top of configure file) (configure --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --enable-mpi --enable-openmp --enable-shared --enable-amd-opt --enable-amd-mpifft --prefix=${INSTALL_DIR_PREFIX_SDK_ROOT} --libdir=${INSTALL_DIR_PREFIX_SDK_ROOT}/lib64)

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include: -O3 -mavx2 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128 -mfma")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 44.34 ms, time: 41.95 ns, ``mflops'': 45764.364
Problem: 128, setup: 110.85 ms, time: 86.59 ns, ``mflops'': 51740.467
Problem: 256, setup: 236.18 ms, time: 186.72 ns, ``mflops'': 54840.945
Problem: 512, setup: 417.41 ms, time: 413.21 ns, ``mflops'': 55758.842
Problem: 1024, setup: 663.19 ms, time: 1.01 us, ``mflops'': 50597.792
Problem: 2048, setup: 1.01 s, time: 2.67 us, ``mflops'': 42176.93
Problem: 4096, setup: 1.56 s, time: 6.65 us, ``mflops'': 36937.948

5) amd-fftw without enable-dynamic-dispatcher and with -O3 and -march=native Following on top of the configure CFLAGS="$CFLAGS: -O3" AMD_ARCH=native

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include: -O3 -march=native -mavx2 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128 -mfma")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 44.52 ms, time: 42.41 ns, ``mflops'': 45274.488
Problem: 128, setup: 110.73 ms, time: 86.25 ns, ``mflops'': 51941.845
Problem: 256, setup: 236.84 ms, time: 190.29 ns, ``mflops'': 53811.935
Problem: 512, setup: 420.15 ms, time: 415.28 ns, ``mflops'': 55480.212
Problem: 1024, setup: 665.51 ms, time: 1.02 us, ``mflops'': 50375.979
Problem: 2048, setup: 1.01 s, time: 2.58 us, ``mflops'': 43653.462
Problem: 4096, setup: 1.56 s, time: 6.54 us, ``mflops'': 37594.598

lamikr commented 4 days ago

If you have some dirty patches available how to get the amd-fftw working on your machine with enable-dynamic-dispatching those could be worth to save.

jeroen-mostert commented 4 days ago

But I think mtune=native is fine and no [march]=native is needed, results between those 2 seemed to quite same for me. (Can you double check) Some people may want to run the system on different machine than where it's build and that may give more incompatibilies.

Yeah, I was just about to say I forgot that building on a VM is a semi-common use case where -march=native would not work. Too bad. :P

For perf comparisons I took the average of 10 runs because, as you've observed, the results wobble a bit and they're all within spitting distance of each other. I don't know if the native bench of fftw is necessarily representative of how real applications are going to be using it.

I'll try again on my machine (including some bigger sizes) to see how much difference the additional flags make, if any. There is no downside to keeping general options like -fomit-frame-pointer -fstrict-aliasing, and I trust the fftw engineers in selecting them. -mtune=native is safe but might misfire (in optimization terms) for VM builds.

If you have some dirty patches available how to get the amd-fftw working on your machine with enable-dynamic-dispatching those could be worth to save.

I have a really dirty patch and I think I can upgrade it to a not-so-dirty patch (in terms of code duplication) that would still keep AMD's ifuncs around. I would feel more comfortable with a bigger test suite that exercises actual scenarios and different architectures (especially downlevel ones with smaller instruction sets), but there's only so many hours in the day and this thing has taken up plenty already. I at least want to verify if we can see the difference between (say) AVX512 vs. non-AVX512, as we ought to be able to.

lamikr commented 4 days ago

I agree, the addition of missing -O3 thing is anyway huge improvement that we should apply now. I did that just by patching the configure file, I do not know if you have better way. Anyway, you noticed it, so can you make the patch?

I think one good thing could be to start a test/benchmark project that would collect all kind of existing tests to do a fast check. Some of the tests I have run could take hours, so looking something faster thats more easy to run often.

Sometimes I test with the gpu benchmark and with some other image learning algorithms like vit. Brian Pulfer has quite interesting repo for all kind of algorithms from which he has written blogs like this one:

https://medium.com/@brianpulfer/vision-transformers-from-scratch-pytorch-a-step-by-step-guide-96c3313c2e0c

Mika

jeroen-mostert commented 4 days ago

I think a better way is the change the .binfo to concatenate and export the new CFLAGS, this also needs a little fix to build.sh to use eval there. The best way is to fix the fftw configure so it can accept additional flags instead of replacements, but this would require regenerating the autoconf stuff and nobody wants to go there. :P I will cook up a patch when I'm done, still running benchmarks for the fftw configs.

For benchmarking the main issue/problem for me would be identifying scenarios where fftw (so CPU-based FFT) actually gets stressed in a task that people fire up their notebooks for, so not the synthetic benchmark that comes with fftw. I'm basically a noob in this area, I'm still dickering around with ready-made docker images for my LLMs. :P

I agree that some sort of general benchmark framework would be handy. At the very least something that can show order of magnitude differences between vanilla builds and ROCm builds, so we can catch oopsies like these were all optimization is actually missing. However, this would also need some way to check the vanilla builds, and since we don't, well, build those, it's a bit tricky. For some of the most common stuff we could probably use a popular ready-to-run container with most of it, it needn't support anything but CPU for sanity checks.

jeroen-mostert commented 4 days ago

Behold, the benchmark results none of us asked for but are here anyway. :P Notes:

Tests were done on a Ryzen 5 7600 (A Zen 4 chip).
Versions used were fftw 3.3.10 and amd-fftw 4.2. Compilation was done with GCC 14.1.1. Some of the "special sauce" AMD adds might only be relevant for clang, but building for clang is currently broken for other reasons and in any case I had no taste to run it all again.
Only double speed was tested.
This uses the tests/bench script included with fftw.
-opatient was used; this makes fftw spend a lot of time trying to optimize for repeated application.
Tests were done for different FFT sizes, but the relative results are consistent across sizes. The table below is for an FFT size of 65536, as performance differences become more pronounced with size.
The exact numbers should be taken with a grain of salt. Despite trying, I couldn't get the benchmark variation down sufficiently unless I wanted to run for unreasonably long times -- not sure what fftw's problem is, there.
All "lower" instruction sets were enabled with an option, so "avx512" implies AVX2, AVX and SSE2 as well.
The benchmark code is a pathetic concoction of a shell script that uses awk and python to cook up CSV files that I then further process manually in a spreadsheet and will not be published, to protect the guilty.

flavor	config	performance (relative)
vanilla	gcc_O3_avx512	3.14
amd	gcc_O3_native_avx512	3.13
amd	gcc_O3_native_avx512_amdopts_dyn	3.12
amd	gcc_O3_avx512	3.11
vanilla	gcc_O3_native_avx512_amdopts	3.1
amd	gcc_O3_avx512_amdopts	3.1
vanilla	gcc_O3_native_avx512	3.08
amd	gcc_O3_avx512_amdopts_dyn	3.05
amd	gcc_O3_native_avx512_amdopts	3.05
vanilla	gcc_O3_avx512_amdopts	2.96
amd	gcc_O3_native_avx2_amdopts	2.92
amd	gcc_O3_native_avx2_amdopts_dyn	2.89
amd	gcc_O3_native_avx2	2.87
amd	gcc_O3_avx2_amdopts_dyn	2.86
amd	gcc_O3_avx2	2.83
amd	gcc_O3_avx2_amdopts	2.82
vanilla	gcc_O3_native_avx2	2.78
vanilla	gcc_O3_native_avx2_amdopts	2.68
vanilla	gcc_O3_avx2_amdopts	2.63
vanilla	gcc_O3_avx2	2.59
amd	gcc_O3_native	1.22
vanilla	gcc_O3_native	1.21
vanilla	gcc_O3	1.01
amd	gcc_O3	1

The conclusions are:

If available you really do want AVX-512 for that >3x speedup.
Optimizing for native helps enormously when there is no dedicated SIMD instruction set to fall back on, but not so much in other cases.
The "special sauce" that AMD is supposed to add in their repo doesn't do much of anything, at least not for my CPU. At least it doesn't hurt.
The dynamic dispatcher has no impact on performance either, with the caveat that this is using my homebrew patch to fix the dynamic loading failure.

I will provide a pull to add -O3 and upgrade to 4.2 while we're at it tomorrow, if nothing intervenes (we are currently using 4.1). As this involves a (hopefully innocuous) change to the main build script and I've been messing with other stuff I want to retest the full build first.

lamikr / rocm_sdk_builder

amd-fftw: problems with linking and performance #82