Ducc0 does not compile on MacOS using gcc

DiamonDinoia commented 2 months ago

Dear @mreineck,

I do not know if you plan to support ducc0 on MacOS. It works with llvm on that toolchain but it breaks with gcc. It seems a small issue. Could you have a look? We would like to support ducc in finufft for MacOS GCC users.

[1/75] Building CXX object CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/math/gl_integrator.cc.o
[2/75] Building CXX object CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/mav.cc.o
FAILED: CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/mav.cc.o 
/usr/local/bin/g++-14  -I/Users/runner/work/finufft/finufft/build/_deps/ducc0-src/src -O3 -DNDEBUG -isysroot /Applications/Xcode_15.2.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.2.sdk -march=native -funroll-loops -ffp-contract=fast -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math -fmerge-all-constants -ftree-vectorize -fimplicit-constexpr -fcx-limited-range -O3 -MD -MT CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/mav.cc.o -MF CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/mav.cc.o.d -o CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/mav.cc.o -c /Users/runner/work/finufft/finufft/build/_deps/ducc0-src/src/ducc0/infra/mav.cc
/Users/runner/work/finufft/finufft/build/_deps/ducc0-src/src/ducc0/infra/mav.cc: In function 'void ducc0::detail_mav::opt_shp_str(fmav_info::shape_t&, std::vector<std::vector<long int> >&)':
Error: /Users/runner/work/finufft/finufft/build/_deps/ducc0-src/src/ducc0/infra/mav.cc:74:25: error: 'min_element' was not declared in this scope
   74 |       auto dim = size_t(min_element(strcrit.begin(),strcrit.begin()+lastdim)
      |                         ^~~~~~~~~~~
[3/75] Building CXX object CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/math/gridding_kernel.cc.o
[4/75] Building CXX object CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/string_utils.cc.o
[5/75] Building CXX object CMakeFiles/ducc0.dir/_deps/ducc0-src/src/ducc0/infra/threading.cc.o
[6/75] Building CXX object CMakeFiles/finufft_f32.dir/src/finufft.cpp.o
[7/75] Building CXX object CMakeFiles/finufft_f32.dir/src/fft.cpp.o
ninja: build stopped: subcommand failed.
Error: Process completed with exit code 1.

Please see: https://github.com/flatironinstitute/finufft/actions/runs/10167458503/job/28119953764

mreineck commented 2 months ago

I must admit that I don't understand what is going wrong here. min_element is defined in std in <algorithm>, and it should be visible at this location. Do you have any insights on this?

mreineck commented 2 months ago

Hmm ... I don't see any explicit flag for C++17 in the compiler call in the log. Perhaps this is the problem; min_element was introduced in C++17. (As far as I know, g++ should default to C++17 as language standard, but I'm not sure whether it will include the proper headers for the tandard library on Mac OS...)

DiamonDinoia commented 2 months ago

It seems fixed in master. The v34 seems to have the issue, it is missing #include <algorithm>.

DiamonDinoia commented 2 months ago

However master fft breaks one of our accuracy tests by a small amount: https://github.com/flatironinstitute/finufft/actions/runs/10184292576/job/28174805483

mreineck commented 2 months ago

It seems fixed in master. The v34 seems to have the issue, it is missing #include .

How do we address this? Should I release version 0.35 soon?

mreineck commented 2 months ago

However master fft breaks one of our accuracy tests by a small amount:

OK, this is strange. I need to take a closer look at the details...

DiamonDinoia commented 2 months ago

It seems fixed in master. The v34 seems to have the issue, it is missing #include .

How do we address this? Should I release version 0.35 soon?

v35 still has the issue of the missing include. I can point finufft to master but we should check why the accuracy is slightly off with no fast-math

mreineck commented 2 months ago

Version 0.35 is not out yet ... when you say "v35 still has the issue of the missing include", what did you test against?

It is possible that FFT accuracy is a tiny little bit worse without fast math, since then FMA instructions are disabled, which incurs two rounding errors instead of one for a multiply-and-add. Still, the difference is so small that it really shouldn't make the difference between pass and fail in a NUFFT test...

DiamonDinoia commented 2 months ago

Version 0.35 is not out yet ... when you say "v35 still has the issue of the missing include", what did you test against?

The commit that updated the changelog here: https://github.com/mreineck/ducc/commit/920155c23631d131c30d4127ce6e16fad3e05533

It is possible that FFT accuracy is a tiny little bit worse without fast math, since then FMA instructions are disabled, which incurs two rounding errors instead of one for a multiply-and-add. Still, the difference is so small that it really shouldn't make the difference between pass and fail in a NUFFT test...

We manually add flags for FMA to ducc and finufft to avoid that issue. But it seems that in single precision for type 3 (that does multiple FFTs) the test does not pass. We do a NUFFT with requested precision 1e-5 and we allow the result to be off by 2e-4 but we get an error of 0.000275 when doing a big FFT and checking one resulting point against the analytical formula.

It is a tiny difference but worth investigating.

mreineck commented 2 months ago

The commit that updated the changelog here: https://github.com/mreineck/ducc/commit/920155c23631d131c30d4127ce6e16fad3e05533

Ah, OK, thanks! I update the ChangeLog for the upcoming version from time to time, to keep track of the new features. As long as the version number in setup.py doesn't change, this doesn't mean much :-)

We do a NUFFT with requested precision 1e-5 and we allow the result to be off by 2e-4 but we get an error of 0.000275 when doing a big FFT and checking one resulting point against the analytical formula.

I'm looking at roughly line 188 in finufft3d_test.cpp. Is that the correct place? If so, I'm not entirely sure that the error calculation does the correct thing. Imagine having a single nonuniform output point, which (let's assume) happens to have value F=0. In that situation, everything explodes, and for small F, measured errors increase a lot. I don't think that measuring the relative error on the output side will be the correct way of testing; you probably need to compare to absolute magnitudes on the input side, but it's definitely not trivial.

DiamonDinoia commented 2 months ago

@ahbarnett, @lu1and10 what do you think?

lu1and10 commented 2 months ago

@ahbarnett, @lu1and10 what do you think?

I'm not sure about what the test check tolerance should be, @ahbarnett may have more insight. In the failing test for single precision, the code with ducc0 resulting error is 2.75e-4 without fp:fast flag, the test check tolerance is 2e-4, they are on the same magnitude though, the test will pass if we adjust tolerance to 3e-4.

@ahbarnett The resulting failed error 2.75e-4 is from the line https://github.com/flatironinstitute/finufft/blob/893bf6f2acf8afd4a716d13f534f517c547ccb02/test/finufft3d_test.cpp#L188

lu1and10 commented 2 months ago

How do we address this? Should I release version 0.35 soon?

@mreineck it will be good to release 0.35 with the header fixed sometime, seems the default branch latest commit works, but ducc0_0_34_0 still have issue on MacOs gcc finding min_element header. @ahbarnett makefile tries too pick ducc0_0_34_0 https://github.com/flatironinstitute/finufft/blob/7ecc0c1460e27de731d8dffb7220ccd4055bf879/makefile#L69 so may fail with macOS gcc.

ahbarnett commented 2 months ago

Several issues here: firstly, I thought ducc/CMakeLists.txt and cmake/setupDUCC.cmake enforced this: target_compile_features(ducc0 PRIVATE cxx_std_17) so I don't understand why C++17 wasn't being used on all DUCC builds?

Now to math tests in FINUFFT: indeed I have been relying on pseudorandom numbers (with fixed seed), and testing both i) the error at one output point (relative to the maximum across all outputs, ie the infinity-norm), and ii) the relative l2 error for the vector of outputs. So, firstly, Martin's concern about "divide by a small number" cannot happen (I already thought of that in 2017...). You'll see the division by the infnorm here: https://github.com/flatironinstitute/finufft/blob/893bf6f2acf8afd4a716d13f534f517c547ccb02/test/finufft3d_test.cpp#L188

The point of having both i) and ii) available is that:

i) can be evaluated for arbitrarily large N,M, but is susceptible to randomness (it's a sample-size of 1), but
ii), although it is almost immune to random variation, can only be computed in reasonable time via a direct summation when NM < 1e7 (say, for a quick test suite).

Both tests have their place. For small NM as in our CI, eg test/check_finufft.sh SINGLE, it makes sense to test both.

Now, errors in FINUFFT should be dominated by kernel choice, not by the FFT. And the pseudorandom variation in FINUFFT test-codes cannot be affected by the choice of FFT. So, making a change to DUCC0 FFT flags causing a change in FINUFFT errors is concerning. Libin, is there a way you could drill down numerically to see what effect the fast flag has on the FINUFFT type 1 output vector (test could be done in matlab, since convenient) ? Or make sure this isn't a change in the thread number in CI? (see below).

For now I am ok with the idea that the rand generator is different in OSX from linux, so FINUFFT single-prec threshold could be raised from 2e-4 to 3e-4. I just noticed the seed is set to the omp_thread_num, so if the test is run on a different number of threads, the results will be different. I am now thinking of removing test i) above from the exit-code in finufft?d_test.cpp, to make it less stochastic.

lu1and10 commented 2 months ago

Libin, is there a way you could drill down numerically to see what effect the fast flag has on the FINUFFT type 1 output vector (test could be done in matlab, since convenient) ? Or make sure this isn't a change in the thread number in CI? (see below).

When @DiamonDinoia and I looked at the failing ci runs, it turns out not deterministic. I recall that some times, if we rerun once, the error is less than 2e-4 passing the test, some times it's 2.75e-4. All these only happens with Windows msvc, linux and Mac were fine. Maybe it's related to the multi-threading with ducc. Though with fftw it always less than 2e-4 on all OSes.

I'm not sure how to drill down which sub flags inside msvc fp:fast makes the error less. @DiamonDinoia do you know how to debug this in msvc for the detail optimization causing this? Normally I thought fp:fast increase error, but this one fp:fast reduced the error with ducc?

DiamonDinoia commented 2 months ago

/fp:fast can make the error smaller. It might rearrange the instructions so that an FMA is possible by converting divisions to multiplications.

How to debug? Not sure, I cannot trigger the issue on my windows laptop.

Given that is not deterministic, it can be caused by the RNG, msvc is likely to use a different implementation than the others.

lu1and10 commented 2 months ago

How to debug? Not sure, I cannot trigger the issue on my windows laptop.

I guess we see the error more often because of the multiple runs with msvc using the ci matrix, if it's deterministic with one thread and the same RNG seed that will be easier to debug.

DiamonDinoia commented 2 months ago

Part of the FINUFFT de-macroize process includes using c++ RNGs, some of them like MT32_64 have the same implementation across platform that will allow to have consistent tests on all platforms. So, we can delay this as it seems a testing issue more than a FINUFFT/DUCC issue.

mreineck commented 2 months ago

Sorry for the delay, I took a few days off ...

So, firstly, Martin's concern about "divide by a small number" cannot happen (I already thought of that in 2017...). You'll see the division by the infnorm here:

It can happen if the number of outputs is 1, and I think that is the case here. (It could also happen for more than one output, but I agree that this becomes extremely improbable with increasing number of outputs.)

[Edit: The reason why I think this is Marco's statement further above that "when doing a big FFT and checking one resulting point against the analytical formula", but I might be misinterpreting this.]

mreineck commented 2 months ago

Regarding reproducibility of ducc.fft results with multithreading: the design is such that the results should be fully deterministic as long as the requested number of threads does not change, but results from runs with different numbers of threads may be very slightly different.

DiamonDinoia commented 2 months ago

Hi Martin before releasing ducc, I would like to improve the cmake file. Can we also add a makefile so to include it in finufft?

Also, may I ask why did you revert to -march=native instead of the flags I proposed? I noticed the error gets smaller but that comes with some issues.

mreineck commented 2 months ago

The issue was that -fcx-limited-range is not recognized by some versions of clang (v16 and v17 give an error, v18 can deal with it), and so some CI runs failed.

We can add a Makefile as well, but I'm not sure how much value that will add. In the end, compiler flags will always depend on the "calling" application.

DiamonDinoia commented 2 months ago

Indeed llvm does not support it yet. That can be removed. All it does is faster complex division/multiplication by skipping -nan checking. One can overload std::complex or write his own class that does skip this controls if crucial.

In cmake library flags are usually not set by the user but each flag is compiled with their own.

mreineck commented 2 months ago

I don't think I will go to such lengths just to avoid the -ffast-math flag. As long as it s not used in the linking step, this is not problematic. And yes, extra checks witin the complex multiplication routine would be quite bad for some parts of ducc0.

ahbarnett commented 2 months ago

Ok. We have added ducc0 to our makefile so have ducc fft fully switchable now. We have also tweaked the test/finufft?d_test executables to exclude the one-output relative error test from the exit code (when full direct output error is available), which will reduce random variation. We are getting 2.3 out and then will be able to think more about templating and kernel function tweaks, happy to have you join meetings if you like. Best, Alex

On Wed, Aug 7, 2024 at 4:31 AM mreineck @.***> wrote:

I don't think I will go to such lengths just to avoid the -ffast-math flag. As long as it s not used in the linking step, this is not problematic. And yes, extra checks witin the complex multiplication routine would be quite bad for some parts of ducc0.

— Reply to this email directly, view it on GitHub https://github.com/mreineck/ducc/issues/39#issuecomment-2272918621, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNZRSRVCGNAIPHVQNJH3LLZQHLLRAVCNFSM6AAAAABLXEMIPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZSHEYTQNRSGE . You are receiving this because you were mentioned.Message ID: @.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

DiamonDinoia commented 2 months ago

I don't think I will go to such lengths just to avoid the -ffast-math flag. As long as it s not used in the linking step, this is not problematic. And yes, extra checks witin the complex multiplication routine would be quite bad for some parts of ducc0.

I think using --ffinite-math-only would achieve the same effect with the other flags.

The problem with -ffast-math is that a python library compiled with it (python loads shared libraries when loading c/c++) broke my code by changing the floating point env of my cpu breaking the following code.

mreineck commented 2 months ago

I fully agree that -fftast-math can break things, but again (and the links confirm that, as far as I can tell) only if you link the shared library with that flag, but not when you compile the object files for tha shared library with the flag.

Is it possible to discern between these two situations with cmake? I'm doing this in my setup.py approach, and I'm fairly sure the resulting .so is safe to load.

mreineck commented 2 months ago

g++ -ffast-math -c a.cc should be fine and does not have a bad influence on other code

g++ -ffast-math a.o -shared -o libfoo.so is dangerous and should be avoided at all costs.

DiamonDinoia commented 2 months ago

in cmake there are:

target_compile_options( [target] [PRIVATE|PUBLIC|INTERFACE] [flags])  # compile only
target_link_options( [target] [PRIVATE|PUBLIC|INTERFACE] [flags]) # link options

mreineck commented 2 months ago

Then I think having -ffast-math in target_compile_options should be safe.

mreineck / ducc

Ducc0 does not compile on MacOS using gcc #39