lamikr / rocm_sdk_builder

Other
113 stars 8 forks source link

amd-fftw: problems with linking and performance #82

Closed jeroen-mostert closed 3 days ago

jeroen-mostert commented 6 days ago

Follow-up from #74 with a more narrow focus.

On my system (Manjaro, latest rolling) building amd-fftw results in a broken libfftw3.so library: attempting to dlopen it causes a segfault. This can be fixed by changing the build so that the --enable-dynamic-dispatcher option is not passed, but while this produces a working library, it results in what appears to be a badly or completely unoptimized code path. For comparison, the results of a benchmark of this build on a Ryzen 5 7600:

> builddir/020_02_amd_fftw_double_precision/tests/bench -I -opatient 64 128 256 512 1024 2048 4096

(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.2.0 Build 20240622")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -mavx2 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128 -mfma")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 41.08 ms, time: 416.05 ns, ``mflops'': 4614.8727
Problem: 128, setup: 99.77 ms, time: 1.15 us, ``mflops'': 3889.3769
Problem: 256, setup: 217.85 ms, time: 2.60 us, ``mflops'': 3942.3856
Problem: 512, setup: 427.52 ms, time: 5.58 us, ``mflops'': 4125.3646
Problem: 1024, setup: 874.12 ms, time: 12.23 us, ``mflops'': 4185.2638
Problem: 2048, setup: 1.98 s, time: 26.82 us, ``mflops'': 4199.1903
Problem: 4096, setup: 4.71 s, time: 58.95 us, ``mflops'': 4168.7358

Now compare this to a benchmark of vanilla fftw-3.3.10 built from source with GCC 14.1.1 with configure --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --enable-openmp --enable-shared:

(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(cc "gcc -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 39.41 ms, time: 36.59 ns, ``mflops'': 52475.262
Problem: 128, setup: 98.67 ms, time: 68.36 ns, ``mflops'': 65536
Problem: 256, setup: 209.93 ms, time: 144.46 ns, ``mflops'': 70883.405
Problem: 512, setup: 371.50 ms, time: 337.13 ns, ``mflops'': 68342.058
Problem: 1024, setup: 585.39 ms, time: 802.37 ns, ``mflops'': 63811.106
Problem: 2048, setup: 876.51 ms, time: 2.01 us, ``mflops'': 56134.985
Problem: 4096, setup: 1.34 s, time: 5.14 us, ``mflops'': 47839.224

These are both single CPU tests. It's obvious that the supposedly optimized AMD build is unfortunately nothing of the sort.

That raises two questions:

  1. Is this only due to omitting --enable-dynamic-dispatcher, or is it a more basic issue? As setting this gives me a library that won't load at all I can't test this.
  2. How much faster is the AMD optimized version of fftw really supposed to be, can someone verify? It's good to know if it's worth going through the hassle to get it to build, as opposed to just using the system fftw. If it doesn't consistently give big wins it's at least worth considering making it optional to avoid any issues.
lamikr commented 6 days ago

Like commented on issue 74, I was finally able to reproduce the issue on Ubuntu 24.04 and your finding fixed it there. I tried to also change the ulimit settings and that was at least not helping.

One possibility is that the root cause is somekind of alignment bug that happens when compiling binaries on certain platforms. I suggest that we resolve the issue on 74 by removing the dynamic-dispatcher option but then keep this bug open.

jeroen-mostert commented 6 days ago

We can do that as a stopgap, sure. My thinking was that if we can't show that amd-fftw is bringing any actual advantage, or (as I've observed) is actively making things worse, then disabling it entirely would bypass all of this without the need for further analysis.

jeroen-mostert commented 6 days ago

Of note is that AMD itself only mentions this targets EPYC. It's unclear how much benefit consumer hardware gets (it's the same micro-arch, so you'd expect at least some benefits, but still, if it optimizes for the cache sizes on EPYC this could be a detriment). Also, this specifically targets CPU optimizations, nothing to do with GPU, so I'd assume this is not relevant to anyone using an Intel CPU + AMD GPU combo (minority or not). All in all this means I think this library has something to prove if it wants to be part of an SDK.

I've rebuilt the source AMD supplies from scratch (outside ROCm) with GCC, clang and hipcc, and while this eliminates the optimization problem, the resulting code is at best on par with the vanilla code, at least on my Ryzen 5 7600 and double precision. That said, I'm no expert in optimizing numerics, so it might be that there are modes that do get huge gains -- but if so I don't know how to bench them.

jeroen-mostert commented 5 days ago

The reason the library ends up unoptimized when building in ROCm is simply because we set CFLAGS to add includes; unfortunately, the way fftw works, these CFLAGS end up overwriting the base project's CFLAGS rather than appending to it (the AMD specific optimizations applied with --enable-amd-opts do get appended, but since this is missing the crucial -O3 that doesn't help much).

We should double check that all the other projects that use autoconf (including quite fundamental ones like OpenMPI) are not similarly affected and devise some sort of general fix, if possible. Appending things like -O3 and -mtune=arch to the CFLAGS for amd-fftw specifically works, of course, but this means the project doesn't get to define its own flags, which seems undesirable. Nor can we assume that such heavy optimizations are safe for every project. If we're only using it to pass includes then a general fix is to make sure that for configure-based projects we can pass these using --includedir, as that's what it's there for.

Unfortunately building the library with optimization, while reducing the size a bit, does not fix the segfault-on-dynamic-dispatch issue. Could have been nice. :P

jeroen-mostert commented 5 days ago

In other news, I've found the root cause for the segfault: the AMD version of the lib uses ifuncs (recently of xz backdoor fame :P) to switch between AVX implementations at runtime (but only for two functions, oddly enough -- some very specific optimization going on there, or maybe experimental stuff). Unfortunately the way they use them is incompatible with loading the library using RTLD_NOW (which can happen in any number of ways, and in fact this may be the result of strengthening patches after the xz debacle), as this forces the ifunc resolvers to run at startup time and the ifuncs in question use external symbols that have not been resolved at that time (which causes a segfault), as well as static variables that haven't had a chance to get initialized (which will not fault, but cause the optimization to never apply).

This only happens if the library is loaded with immediate resolving; if loaded lazily, no segfault will happen. However, ifuncs really ought to be written in such a way that they're safe either way, and ideally they're not used at all, for these and other reasons. My current "fix" consists of crudely making static copies of the have_simd_avx and have_simd_avx512 functions, which fixes the segfault, confirming the issue. I will cogitate some more on ideas for proper fixes, which will likely need to go upstream. For the ROCm SDK specifically, we should be good with the dynamic dispatcher fix for now, since the worst that can happen is a slight perf regression, and since nobody up to now has noticed we're actually running without any optimizations, I think we're good on that front. :P

lamikr commented 5 days ago

I tried to use the clang instead of gcc (CC=clang CXX=clang++) and then I had linking errors related to same ifuncs. It seems that clang optimized the static functions that were referenced by the ifuncs away and that causes linker failures which does not occur when build with gcc.

jeroen-mostert commented 5 days ago

The clang errors are a separate thing, in that it apparently uses implicit ifuncs as a result of __attribute__((target_clones ..)). This goes right almost everywhere except where it doesn't, for unclear reasons. These implicit ifuncs the compiler defines should always be safe, in that they'll never call functions they're not supposed to call, and they should never get optimized out, since that defeats the whole purpose. This may be a clang bug. I wonder how frequently AMD builds their own stuff. :P

jeroen-mostert commented 4 days ago

Quick thought while I was struggling with this beast: I propose we simply build the library with gcc and CFLAGS="-O3 -fomit-frame-pointer -march=native -malign-double -fstrict-aliasing -fno-schedule-insns" (rather than the default -mtune=native) and no --enable-dynamic-dispatcher. This avoids all the dynamic runtime lookup nonsense (we can leave fixing that to upstream, for those who care) and builds a library that's simply optimized for the running system, which is what we're doing anyway (at least for the GPU), unless I'm missing some use case. In my tests this, at long last, actually succeeds at producing a library that's faster than just building the native fftw (albeit marginally, no more than 5% in the best case for the default benchmark). The only remaining issue is that the library will still not be optimized for Intel users, obviously. But then again, hopefully, hipFFT/rocFFT is going to be used for GPU offloading where possible anyway. Thoughts?

lamikr commented 4 days ago

I agree with you that the current patch in master for dropping the --enable-dynamic-dispatcher is fine as with it the system works on all Linux distros. And also the addition of "-O3" is important. On mageia I was able to benchmark all combinations and also the fftw-3.10 and it's hard to see the benefit of this dynamic-dispatcher at least on my 6900hs laptop. (It would be interesting to run these tests on epyc)

But I think mtune=native is fine and no mtune=native is needed, results between those 2 seemed to quite same for me. (Can you double check) Some people may want to run the system on different machine than where it's build and that may give more incompatibilies.

Anyway, here are the results from 1 run on each machine. I did just run the tests without reboots or temperature control between runs. When running multiple times, results can vary some percentages, so these are more like a guidelines than 100% scientific.

1) fftw-3.10

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(cc "gcc -O3 -fomit-frame-pointer -mtune=native -malign-double -fstrict-aliasing -fno-schedule-insns")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 43.91 ms, time: 44.86 ns, ``mflops'': 42799.02
Problem: 128, setup: 108.97 ms, time: 89.56 ns, ``mflops'': 50021.515
Problem: 256, setup: 231.00 ms, time: 190.40 ns, ``mflops'': 53781.747
Problem: 512, setup: 405.67 ms, time: 476.59 ns, ``mflops'': 48343.134
Problem: 1024, setup: 634.14 ms, time: 1.01 us, ``mflops'': 50585.588
Problem: 2048, setup: 957.68 ms, time: 2.65 us, ``mflops'': 42558.199
Problem: 4096, setup: 1.47 s, time: 6.52 us, ``mflops'': 37667.75```

2) amd-fftw with enable-dynamic-dispatcher and without O3

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 46.64 ms, time: 457.12 ns, ``mflops'': 4200.1843
Problem: 128, setup: 110.45 ms, time: 1.32 us, ``mflops'': 3390.0018
Problem: 256, setup: 235.03 ms, time: 2.91 us, ``mflops'': 3516.3514
Problem: 512, setup: 465.78 ms, time: 6.56 us, ``mflops'': 3514.5181
Problem: 1024, setup: 943.84 ms, time: 14.37 us, ``mflops'': 3563.6759
Problem: 2048, setup: 2.12 s, time: 31.82 us, ``mflops'': 3540.3118
Problem: 4096, setup: 5.06 s, time: 69.58 us, ``mflops'': 3531.9463

3) amd-fftw with enable-dynamic-dispatcher and with O3 (CFLAGS="$CFLAGS: -O3" added to top of configure file)

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include: -O3 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 43.91 ms, time: 41.45 ns, ``mflops'': 46324.572
Problem: 128, setup: 110.08 ms, time: 86.20 ns, ``mflops'': 51969.427
Problem: 256, setup: 235.29 ms, time: 185.65 ns, ``mflops'': 55156.459
Problem: 512, setup: 415.35 ms, time: 408.81 ns, ``mflops'': 56358.22
Problem: 1024, setup: 659.19 ms, time: 997.31 ns, ``mflops'': 51337.87
Problem: 2048, setup: 995.22 ms, time: 2.57 us, ``mflops'': 43827.628
Problem: 4096, setup: 1.55 s, time: 6.82 us, ``mflops'': 36018.068

4) amd-fftw without enable-dynamic-dispatcher and with O3 (CFLAGS="$CFLAGS: -O3" added to top of configure file) (configure --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --enable-mpi --enable-openmp --enable-shared --enable-amd-opt --enable-amd-mpifft --prefix=${INSTALL_DIR_PREFIX_SDK_ROOT} --libdir=${INSTALL_DIR_PREFIX_SDK_ROOT}/lib64)

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include: -O3 -mavx2 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128 -mfma")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 44.34 ms, time: 41.95 ns, ``mflops'': 45764.364
Problem: 128, setup: 110.85 ms, time: 86.59 ns, ``mflops'': 51740.467
Problem: 256, setup: 236.18 ms, time: 186.72 ns, ``mflops'': 54840.945
Problem: 512, setup: 417.41 ms, time: 413.21 ns, ``mflops'': 55758.842
Problem: 1024, setup: 663.19 ms, time: 1.01 us, ``mflops'': 50597.792
Problem: 2048, setup: 1.01 s, time: 2.67 us, ``mflops'': 42176.93
Problem: 4096, setup: 1.56 s, time: 6.65 us, ``mflops'': 36937.948

5) amd-fftw without enable-dynamic-dispatcher and with -O3 and -march=native Following on top of the configure CFLAGS="$CFLAGS: -O3" AMD_ARCH=native

./bench -I -opatient 64 128 256 512 1024 2048 4096
(name "fftw3")
(version "fftw-3.3.10-sse2-avx-avx2-avx2_128-avx512")
(aocl-version "AOCL-FFTW 4.1.0 Build 20240624")
(cc "gcc -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include: -O3 -march=native -mavx2 -mno-avx256-split-unaligned-store -mno-avx256-split-unaligned-load -mno-prefer-avx128 -mfma")
(codelet-optim "")
(benchmark-precision "double")
Problem: 64, setup: 44.52 ms, time: 42.41 ns, ``mflops'': 45274.488
Problem: 128, setup: 110.73 ms, time: 86.25 ns, ``mflops'': 51941.845
Problem: 256, setup: 236.84 ms, time: 190.29 ns, ``mflops'': 53811.935
Problem: 512, setup: 420.15 ms, time: 415.28 ns, ``mflops'': 55480.212
Problem: 1024, setup: 665.51 ms, time: 1.02 us, ``mflops'': 50375.979
Problem: 2048, setup: 1.01 s, time: 2.58 us, ``mflops'': 43653.462
Problem: 4096, setup: 1.56 s, time: 6.54 us, ``mflops'': 37594.598
lamikr commented 4 days ago

If you have some dirty patches available how to get the amd-fftw working on your machine with enable-dynamic-dispatching those could be worth to save.

jeroen-mostert commented 4 days ago

But I think mtune=native is fine and no [march]=native is needed, results between those 2 seemed to quite same for me. (Can you double check) Some people may want to run the system on different machine than where it's build and that may give more incompatibilies.

Yeah, I was just about to say I forgot that building on a VM is a semi-common use case where -march=native would not work. Too bad. :P

For perf comparisons I took the average of 10 runs because, as you've observed, the results wobble a bit and they're all within spitting distance of each other. I don't know if the native bench of fftw is necessarily representative of how real applications are going to be using it.

I'll try again on my machine (including some bigger sizes) to see how much difference the additional flags make, if any. There is no downside to keeping general options like -fomit-frame-pointer -fstrict-aliasing, and I trust the fftw engineers in selecting them. -mtune=native is safe but might misfire (in optimization terms) for VM builds.

If you have some dirty patches available how to get the amd-fftw working on your machine with enable-dynamic-dispatching those could be worth to save.

I have a really dirty patch and I think I can upgrade it to a not-so-dirty patch (in terms of code duplication) that would still keep AMD's ifuncs around. I would feel more comfortable with a bigger test suite that exercises actual scenarios and different architectures (especially downlevel ones with smaller instruction sets), but there's only so many hours in the day and this thing has taken up plenty already. I at least want to verify if we can see the difference between (say) AVX512 vs. non-AVX512, as we ought to be able to.

lamikr commented 4 days ago

I agree, the addition of missing -O3 thing is anyway huge improvement that we should apply now. I did that just by patching the configure file, I do not know if you have better way. Anyway, you noticed it, so can you make the patch?

I think one good thing could be to start a test/benchmark project that would collect all kind of existing tests to do a fast check. Some of the tests I have run could take hours, so looking something faster thats more easy to run often.

Sometimes I test with the gpu benchmark and with some other image learning algorithms like vit. Brian Pulfer has quite interesting repo for all kind of algorithms from which he has written blogs like this one:

https://medium.com/@brianpulfer/vision-transformers-from-scratch-pytorch-a-step-by-step-guide-96c3313c2e0c

Mika

jeroen-mostert commented 4 days ago

I think a better way is the change the .binfo to concatenate and export the new CFLAGS, this also needs a little fix to build.sh to use eval there. The best way is to fix the fftw configure so it can accept additional flags instead of replacements, but this would require regenerating the autoconf stuff and nobody wants to go there. :P I will cook up a patch when I'm done, still running benchmarks for the fftw configs.

For benchmarking the main issue/problem for me would be identifying scenarios where fftw (so CPU-based FFT) actually gets stressed in a task that people fire up their notebooks for, so not the synthetic benchmark that comes with fftw. I'm basically a noob in this area, I'm still dickering around with ready-made docker images for my LLMs. :P

I agree that some sort of general benchmark framework would be handy. At the very least something that can show order of magnitude differences between vanilla builds and ROCm builds, so we can catch oopsies like these were all optimization is actually missing. However, this would also need some way to check the vanilla builds, and since we don't, well, build those, it's a bit tricky. For some of the most common stuff we could probably use a popular ready-to-run container with most of it, it needn't support anything but CPU for sanity checks.

jeroen-mostert commented 4 days ago

Behold, the benchmark results none of us asked for but are here anyway. :P Notes:

flavor config performance (relative)
vanilla gcc_O3_avx512 3.14
amd gcc_O3_native_avx512 3.13
amd gcc_O3_native_avx512_amdopts_dyn 3.12
amd gcc_O3_avx512 3.11
vanilla gcc_O3_native_avx512_amdopts 3.1
amd gcc_O3_avx512_amdopts 3.1
vanilla gcc_O3_native_avx512 3.08
amd gcc_O3_avx512_amdopts_dyn 3.05
amd gcc_O3_native_avx512_amdopts 3.05
vanilla gcc_O3_avx512_amdopts 2.96
amd gcc_O3_native_avx2_amdopts 2.92
amd gcc_O3_native_avx2_amdopts_dyn 2.89
amd gcc_O3_native_avx2 2.87
amd gcc_O3_avx2_amdopts_dyn 2.86
amd gcc_O3_avx2 2.83
amd gcc_O3_avx2_amdopts 2.82
vanilla gcc_O3_native_avx2 2.78
vanilla gcc_O3_native_avx2_amdopts 2.68
vanilla gcc_O3_avx2_amdopts 2.63
vanilla gcc_O3_avx2 2.59
amd gcc_O3_native 1.22
vanilla gcc_O3_native 1.21
vanilla gcc_O3 1.01
amd gcc_O3 1

The conclusions are:

I will provide a pull to add -O3 and upgrade to 4.2 while we're at it tomorrow, if nothing intervenes (we are currently using 4.1). As this involves a (hopefully innocuous) change to the main build script and I've been messing with other stuff I want to retest the full build first.