gwastro / pycbc

Core package to analyze gravitational-wave data, find signals, and study their parameters. This package was used in the first direct detection of gravitational waves (GW150914), and is used in the ongoing analysis of LIGO/Virgo data.
http://pycbc.org
GNU General Public License v3.0
313 stars 347 forks source link

PyCBC threshold code is broken on intel mac with openMP #4828

Open spxiwh opened 1 month ago

spxiwh commented 1 month ago

See discussion here https://github.com/conda-forge/pycbc-feedstock/pull/84. There is some issue with the openmp code used for the threshold code, which appears to manifest on old intel macs. This is covered (thankfully) by our unit tests, but it is possible that whatever is causing this will appear in other contexts in the future. We should understand and fix ... But we need to be able to reproduce it first!

josh-willis commented 1 month ago

I unfortunately don't have time to really look into this, but I will record here my initial suggestion, especially as I think we now know that this only occurs with OpenMP enabled.

If someone can reproduced this, I would recommend seeing if we can change these lines to instead read:

            #pragma omp ordered
            {
                t+=c;
            }
            #pragma omp ordered
            {
                memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
                memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);
            }

That is, enure that the memmove are performed inside a barrier that enforces they happen sequentially. It's just a suggestion, but if anyone ends up able to reproduce the problem, it should be easy to test if that fixes it. But note that first you also have to ensure that the total number of all points above threshhold (t) has been found.

spxiwh commented 1 month ago

Let me ping @duncanmmacleod on this as well, as this will affect conda builds until this is resolved.

spxiwh commented 1 month ago

I've been trying to reproduce this on my ARM mac, using it's builtin x86_64 emulation. I could not reproduce the failure.

One question: @duncanmmacleod in your conda patch, you only include the omp header files in the compilation. You do not add -fopenmp to the linking stage back in setup.py, which I think is needed to actually use openmp (at least the .so files produced are not linked to libomp unless I add this ... But I don't see the failrues in any case).

Noting that this is x86_64 bit compiled:

(intel_env) [iwharry@harrymac events]$ file simd_threshold_cython.cpython-312-darwin.so 
simd_threshold_cython.cpython-312-darwin.so: Mach-O 64-bit bundle x86_64
spxiwh commented 1 month ago

@duncanmmacleod This is possibly version dependent (OMP version, compiler version etc.). You said you had reproduced this on an old mac, is there anything I can use from that to reproduce here? (Is this a fully isolated conda env for e.g., could I have the package list ... If using compilers installed system-wide, what versions are they? )

spxiwh commented 2 weeks ago

@duncanmmacleod Just a poke to ask again if you have any more details on how you reproduced this? I've been unable to test Josh's proposed fix, without being able to reproduce the error!

titodalcanton commented 2 weeks ago

I have an Intel Mac, I am testing to see if I can reproduce the problem. I am having trouble compiling/linking against OpenMP however:

clang: error: unsupported option '-fopenmp'

and I seem to have

$ clang --version
Apple clang version 11.0.0 (clang-1100.0.33.16)
Target: x86_64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

[some minutes later]

Ok, by installing a local version of clang like so:

conda install clang_osx-64 clangxx_osx-64

I got PyCBC to compile and fail the test!

$ clang --version              
clang version 14.0.6
Target: x86_64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Users/tito/miniconda3/envs/pycbc-dev/bin
$ python test/test_threshold.py 
========================================================================
Running CPU unit tests for Threshold:
test_threshold (__main__.TestThreshold.test_threshold) ... 14215 14215
FAIL

======================================================================
FAIL: test_threshold (__main__.TestThreshold.test_threshold)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/tito/Documents/projects/pycbc/pycbc/test/test_threshold.py", line 56, in test_threshold
    self.assertTrue((vals == self.vals).all())
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.054s

FAILED (failures=1)

[some minutes later]

After some extra prints I get that locs is identical, but vals has 4281 differences out of 14026. The differences happen because the SIMD version returns vals = 0 when loc happens to be in two ranges, [65545, 131031] and [589840, 655289] (though I imagine the change actually happens at power of 2 boundaries).

After some more investigation with several runs (the random seed changes each time), it seems that the discrepancies happen within ranges of loc of length 65536, and they only happen for the "odd blocks":

loc 65442 equal, loc 65549 different: 65536 = 65536 * 1 lies in between
loc 131064 different, loc 131248 equal: 131072 = 65536 * 2 lies in between
loc 196546 equal, loc 196645 different: 196608 = 65536 * 3 lies in between
loc 261930 different, loc 262146 equal: 262144 = 65536 * 4 lies in between
loc 327564 equal, loc 327847 different: 327680 = 65536 * 5 lies in between
loc 393192 different, loc 393341 equal: 393216 = 65536 * 6 lies in between
loc 458694 equal, loc 459019 different: 458752 = 65536 * 7 lies in between
loc 524226 different, loc 524342 equal: 524288 = 65536 * 8 lies in between
loc 589774 equal, loc 589870 different: 589824 = 65536 * 9 lies in between
loc 655345 different, loc 655370 equal: 655360 = 65536 * 10 lies in between

Not sure what this is telling me without going through the code.

Perhaps this way of visualizing it is a bit clearer:

locs in [0, 65536) are all equal (961 values)
locs in [65536, 131072) are all different (876 values)
locs in [131072, 196608) are all equal (855 values)
locs in [196608, 262144) are all different (848 values)
locs in [262144, 327680) are all equal (920 values)
locs in [327680, 393216) are all equal (830 values)
locs in [393216, 458752) are all equal (849 values)
locs in [458752, 524288) are all different (926 values)
locs in [524288, 589824) are all equal (862 values)
locs in [589824, 655360) are all different (847 values)
locs in [655360, 720896) are all equal (885 values)
locs in [720896, 786432) are all equal (853 values)
locs in [786432, 851968) are all equal (908 values)
locs in [851968, 917504) are all different (839 values)
locs in [917504, 983040) are all equal (833 values)
locs in [983040, 1048576) are all equal (889 values)

and apparently the odd/even alternance is not repeatable, here is another run:

locs in [0, 65536) are all equal (859 values)
locs in [65536, 131072) are all different (861 values)
locs in [131072, 196608) are all equal (872 values)
locs in [196608, 262144) are all different (904 values)
locs in [262144, 327680) are all equal (869 values)
locs in [327680, 393216) are all different (881 values)
locs in [393216, 458752) are all equal (929 values)
locs in [458752, 524288) are all equal (909 values)
locs in [524288, 589824) are all equal (865 values)
locs in [589824, 655360) are all equal (876 values)
locs in [655360, 720896) are all equal (913 values)
locs in [720896, 786432) are all different (887 values)
locs in [786432, 851968) are all equal (883 values)
locs in [851968, 917504) are all different (892 values)
locs in [917504, 983040) are all equal (810 values)
locs in [983040, 1048576) are all equal (862 values)

So it seems that the discrepancy is confined to ranges in locs that have length 65536, and inside each range either all are consistent or all are different, but whether a given block is consistent or different is random.

I am going to stop here for today, but I can do more tests if needed.