manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

special handling of linear binning #258

Closed adematti closed 2 years ago

adematti commented 2 years ago

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins. There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200. I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning (which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85). The following pair counters have been updated: mocks: DDrppi_mocks, DDsmu_mocks theory: DD, DDrppi, DDsmu, xi, wp I did not update DDtheta_mocks (yet), as there are implementation choices to be made: a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi) b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation To be discussed, also: 1) should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal) 2) should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function? 3) should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats. Not sure how it behaves in practice for the latter case (floats), see remark ii) below. I also have a couple of remarks: i) this line https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/Corrfunc/utils.py#L461 seems unnecessary? (as array updates are done in place and calls to this functions do not retrieve the returned arrays) ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here: https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src#L534 Is there any reason not to use a single version of this function, setup_bins? iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention; export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Thanks! Best, Arnaud

manodeep commented 2 years ago

I can't easily test with icc because on there's no available python installed for icc. Trying through conda - will report back

edit This is a strong suggestion to me that we need to offload the testing into C - something like the INTEGRATION_TESTS macro but specifically for the binning.

lgarrison commented 2 years ago

Can you try disabling -march=native entirely? And/or try without -O3?

manodeep commented 2 years ago

Can you try disabling -march=native entirely? And/or try without -O3?

Still fails with gcc/9 with both -march=native and -O3 off

lgarrison commented 2 years ago

That's really interesting. Maybe we are triggering undefined behavior or a thread race. But it sounds like it's probably failing even single-threaded?

manodeep commented 2 years ago

The supercomputer admin (Robin) noted a combination of python + numpy modules to use with the intel compiler. Once I loaded those up along with the intel-2018 compiler, the tests passed. So looking more of a compiler bug at the moment

I have a helpdesk ticket open on the supercomputer helpdesk, and will report back on what I find.

lgarrison commented 2 years ago

I tried to reproduce reproduce @manodeep's issue using gcc/9.3.0 (I don't have easy access to 9.2.0), but wasn't able to. I ran the linear binning under the clang address/undefined/memory sanitizers, but I didn't find any issues. However, the executable I tested was theory/tests/test_periodic (modified with options.bin_type=BIN_LIN;), and I'm not sure if that executable exhibits @manodeep's issue. Sanitizing the Python extensions is hard because of all the false positives from Python!

I did find a few warnings from the thread sanitizer, but none of them seem related to this issue. I couldn't make sense of all of the messages, but at least one of them did seem to be real, albeit probably not in a way that affected anything. I've pushed a fix for that, just in case.

So... any ideas on the next steps? @manodeep, just to confirm, the issue is deterministic and happens for any number of threads?

manodeep commented 2 years ago

Yup the failure is deterministic - fails the first time every time. I just tried with nthreads=1, 2, 3 and the tests failed in all cases

lgarrison commented 2 years ago

@manodeep to aid the debugging, I've created a stripped-down test in theory/tests/test_periodic_lin.c. It runs linear binning and compares against the custom-binning answer. It passes for me, but my hope is that it will fail for you and that you might be able to try simple variations of it to narrow down the cause. It's also simple enough that the sanitizers may yield useful information. Would you mind trying it out? You can run it with make test_periodic_lin.

manodeep commented 2 years ago

That test passes! Uses the AVX512 kernel too (at least according to output)

-------COMPILE SETTINGS------------
         MAKE            = ["make"]
         CC              = ["gcc"]
         OPT             = ["-DPERIODIC -DENABLE_MIN_SEP_OPT  -DCOPY_PARTICLES  -DUSE_OMP"]
         CFLAGS          = [" -DVERSION=\"2.4.0\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs "]
         CLINK           = [" -lrt  -fopenmp -lm"]
         PYTHON          = ["python"]
         GSL_CFLAGS      = ["-I/apps/skylake/software/GSL/2.5-GCC-9.2.0/include"]
         GSL_LINK        = ["-L/apps/skylake/software/GSL/2.5-GCC-9.2.0/lib -lgsl -lgslcblas -lm -Xlinker -rpath -Xlinker /apps/skylake/software/GSL/2.5-GCC-9.2.0/lib"]
         PYTHON_CFLAGS   = ["-isystem/apps/skylake/software/Python/3.8.5-gni-2020.0/include/python3.8 -isystem /apps/skylake/software/numpy/1.19.2-gni-2020.0-Python-3.8.5/lib/python3.8/site-packages/numpy-1.19.2-py3.8-linux-x86_64.egg/numpy/core/include/numpy/"]
-------END OF COMPILE SETTINGS------------

./test_periodic_lin
WARNING: File was written in a different precision than requested (file precision = 4 requested precision = 8)
Linear binning
Running with points in [xmin,xmax] = 0.000400,419.999817 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.000100,419.999512 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.000300,419.999207 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 52,52,52.  Time taken =   0.107 sec
Using AVX512 kernel
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken =  0.620 secs
PASSED: Mr19 DD (periodic). Time taken =     0.87 seconds 
PASSED: ALL 1 tests. Total time =     0.93 seconds 
lgarrison commented 2 years ago

Darn... do you think you can try some variations to get it to fail? E.g. different kernels, number of bins, number of data points, auto vs cross-corr? Maybe the fastest way to get it to fail would be to write the exact particles used by the Python tests.py script to disk, and have this test_periodic_lin test load them instead of the Mr19 particles.

manodeep commented 2 years ago

Yesterday, I tested with -DINTEGRATION_TESTS and I got a illegal signal (SIGILL) for the SSE kernel(fallback kernel works fine) - that should not have happened. I can't reproduce that SIGILL when I add -DINTEGRATION_TESTS to the top of the source test_periodic_lin.c file, rather than to common.mk. From what I can tell, that's happening because of the sanitize options that get added into the compile line.

To check my sanity and any potential for user-error: make tests_bin passes but running python -m Corrfunc.tests immediately after, fails.

manodeep commented 2 years ago

@lgarrison Did you want me to merge in my updates from this PR into the linearbinning-rounding branch and then we can work on that branch? Otherwise, I can create a new branch and push to that

lgarrison commented 2 years ago

Actually I think we can just work in this branch (as long as @adematti doesn't mind!). I think the "Allow edits by maintainers" option is on; I've been able to push to this branch with no problems.

But if you have updates that logically should go in another branch then that's totally fine too.

manodeep commented 2 years ago

Alright changes pushed. The new tests fail on the supercomputer - I am curious to see what happens on the CI

lgarrison commented 2 years ago

The new tests do fail on the CI, but only by 4 pairs. I think this is an "ordinary" failure due to floating-point precision, and not the gcc/9.2 failure, which was by thousands of pairs. When you run these tests locally, how many pairs do they fail by?

lgarrison commented 2 years ago

Thanks to the excellent local scientific computing staff, I was able to get a gcc/9.2 binutils/2.33.1 toolchain to test with. Unfortunately, it did not exhibit the bug.

One suggestion that came up in conversation with them: gcc might be emitting some SIMD instructions even without -march=native. Can you try -march=x86-64 or something equally innocuous?

lgarrison commented 2 years ago

Okay, I've made some progress! I was able to reproduce a large failure and narrow it down to 3 particles. From there, I was able to find a bug in the sse42 kernel.

In the custom binning version, if a pair separation falls outside the bins, we mark this by setting a bit mask and setting separation to rmax. This happens here: https://github.com/adematti/Corrfunc/blob/eb3424a699771c077be900c78b1f7f7b8ead86d3/theory/DD/countpairs_kernels.c.src#L817-L825

But then in linear binning, we don't look at the bit mask, and just look at the separations. This causes the rmax separation to potentially fall in the last bin, depending on the exact floating point value of the inverse separation. This happens here: https://github.com/adematti/Corrfunc/blob/eb3424a699771c077be900c78b1f7f7b8ead86d3/theory/DD/countpairs_kernels.c.src#L835-L838

I admit I'm not sure I'm following the SSE code correctly. It's not immediately obvious why we both set the bit mask and set the separation to rmax. @manodeep, do you know why?

I'm also not sure if this is @manodeep's issue. It's definitely a bug (the reproducer is below), but it doesn't seem to depend on the compiler version, and it happens for SSE, not fallback. But I haven't yet checked if the fallback/AVX/AVX512 kernels use the same pattern.

```python #!/usr/bin/env python3 import Corrfunc import numpy as np from astropy.table import Table pos = np.array(Corrfunc.io.read_catalog()) pos = pos.T #p = pos[:10] p = pos[[7,8,9]] print(p) for i in range(len(p)): for j in range(i+1,len(p)): dist = ((p[i] - p[j])**2).sum(axis=-1)**0.5 print(f'{i},{j} {dist}') nbin = 7 bins = np.linspace(0.1, 7, nbin+1) isa = 'sse42' res = Corrfunc.theory.DD(1, 1, bins, *p.T, boxsize=420., isa=isa, verbose=False, bin_type='custom') res = Table(res) reslin = Corrfunc.theory.DD(1, 1, bins, *p.T, boxsize=420., isa=isa, verbose=False, bin_type='lin') reslin = Table(reslin) print(res) print(reslin) ``` ``` (forge) lgarrison@worker1012:~/abacus-analysis/Corrfunc/gh258$ ./repro.py [[ 8.06210041 0.4235 4.89410019] [11.9282999 4.38660002 4.54409981] [11.95542526 4.32622337 4.51485252]] 0,1 5.547626567065066 0,2 5.5256725303853615 1,2 0.07236385736491963 rmin rmax ravg npairs weightavg ------------------ ------------------ ---- ------ --------- 0.1 1.0857142857142859 0.0 0 0.0 1.0857142857142859 2.0714285714285716 0.0 0 0.0 2.0714285714285716 3.0571428571428574 0.0 0 0.0 3.0571428571428574 4.042857142857143 0.0 0 0.0 4.042857142857143 5.0285714285714285 0.0 0 0.0 5.0285714285714285 6.014285714285714 0.0 4 0.0 6.014285714285714 7.0 0.0 0 0.0 rmin rmax ravg npairs weightavg ------------------ ------------------ ---- ------ --------- 0.1 1.0857142857142859 0.0 0 0.0 1.0857142857142859 2.0714285714285716 0.0 0 0.0 2.0714285714285716 3.0571428571428574 0.0 0 0.0 3.0571428571428574 4.042857142857143 0.0 0 0.0 4.042857142857143 5.0285714285714285 0.0 0 0.0 5.0285714285714285 6.014285714285714 0.0 4 0.0 6.014285714285714 7.0 0.0 2 0.0 # <<< this pair should not be here ```
manodeep commented 2 years ago

Thanks to the excellent local scientific computing staff, I was able to get a gcc/9.2 binutils/2.33.1 toolchain to test with. Unfortunately, it did not exhibit the bug.

One suggestion that came up in conversation with them: gcc might be emitting some SIMD instructions even without -march=native. Can you try -march=x86-64 or something equally innocuous?

Was this on a CPU with AVX512F?

manodeep commented 2 years ago

The new tests do fail on the CI, but only by 4 pairs. I think this is an "ordinary" failure due to floating-point precision, and not the gcc/9.2 failure, which was by thousands of pairs. When you run these tests locally, how many pairs do they fail by?

Yup - agreed. This is likely a floating-point round-off issue rather than a more severe underlying bug. I was interested to see if the CI (on different hardware + software infra) failed on these new tests. Now that it looks like the CI fails, I will let the tests run through and see how many of the tests fail and by how much in each of the bins (i.e., are these floating point issues or something else)

lgarrison commented 2 years ago

Thanks to the excellent local scientific computing staff, I was able to get a gcc/9.2 binutils/2.33.1 toolchain to test with. Unfortunately, it did not exhibit the bug. One suggestion that came up in conversation with them: gcc might be emitting some SIMD instructions even without -march=native. Can you try -march=x86-64 or something equally innocuous?

Was this on a CPU with AVX512F?

Yes, I tried multiple CPUs, including those with and without AVX512.

manodeep commented 2 years ago

Thanks to the excellent local scientific computing staff, I was able to get a gcc/9.2 binutils/2.33.1 toolchain to test with. Unfortunately, it did not exhibit the bug. One suggestion that came up in conversation with them: gcc might be emitting some SIMD instructions even without -march=native. Can you try -march=x86-64 or something equally innocuous?

Was this on a CPU with AVX512F?

Yes, I tried multiple CPUs, including those with and without AVX512.

Huh - I am using gcc/9.2.0 and binutils/2.33.1. Wonder why we are seeing different results on effectively the same CPU + toolchain.

lgarrison commented 2 years ago

Yeah, it's weird, but also possible I messed up something in my tests, or that the failure is coming in elsewhere, like a linked library.

Any thoughts on if my read of the SSE bug is correct? And what the best way to fix it might be? I guess we could just overwrite r2 with a bigger value than rmax.

manodeep commented 2 years ago

I am still puzzled by the failure in the FALLBACK kernel, and the mis-match between the python results (as in the actual number of pairs, not just between the custom and linear bin types within a testing harness) and the C results.

Done reading the data - time taken =        1.2 seconds
Beginning Theory Correlation functions calculations
binfile = [ 5.1  6.1  7.1  8.1  9.1 10.1 11.1 12.1 13.1 14.1 15.1 16.1 17.1 18.1
 19.1 20.1 21.1 22.1 23.1 24.1 25.1 26.1 27.1 28.1 29.1 30.1]
Running 3-D correlation function DD(r) (isa=fallback)
Custom binning  
Running with points in [xmin,xmax] = 0.000400,419.999817 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.000100,419.999512 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.000300,419.999207 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 27,27,13.  Time taken =   0.164 sec
Using fallback kernel
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken = 44.157 secs
Linear binning
Running with points in [xmin,xmax] = 0.000400,419.999817 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.000100,419.999512 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.000300,419.999207 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 27,27,13.  Time taken =   0.148 sec
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken = 26.584 secs
Mis-match for npairs: (a, b) = [(62877472, 62754784), (78053816, 77985808), (95087768, 95001016), (113898272, 113848184), (134886224, 134795296), (157769872, 157678056), (181821432, 181754832), (207652120, 207602624), (235347640, 235250760), (264986328, 264816112), (296809304, 296635800), (330338968, 330253816), (365333792, 365246240), (403218408, 403068456), (442352288, 442215208), (483228064, 483199000), (526676640, 526695184), (571982752, 572006992), (619704272, 619652040), (669305888, 669225568), (721172136, 721062376), (774778408, 774635120), (830188704, 830089960), (888092192, 887990376), (946930304, 946695208)] 
Mis-match for weightavg: (a, b) = [(0.562825815712476, 0.5628193730680126), (0.5627846481782354, 0.5627211027502464), (0.5624910928362716, 0.5624627003045992), (0.5625675596678956, 0.5625419184184755), (0.5626204591622173, 0.5625463179074152), (0.5626152639512979, 0.5625515815364374), (0.5624795476034595, 0.5624025335691824), (0.5625979690855286, 0.5625576307882932), (0.5625224962095966, 0.5624903745251197), (0.5626959192348553, 0.5626562965625453), (0.5626557659138799, 0.5626197098970304), (0.5627131464760827, 0.5626851580675043), (0.5625921173936103, 0.5625494022708031), (0.562640937769052, 0.5626023279874809), (0.5626230714190515, 0.5625917733991326), (0.5626902686220372, 0.562672483557968), (0.5626678796933742, 0.5626525397298503), (0.5626999385007042, 0.5626957858102613), (0.5626380544624989, 0.5626038471708693), (0.5626336402866137, 0.5626080653046808), (0.5626252511128291, 0.5625942514881577), (0.5626175708534336, 0.5625871978209532), (0.5625824122697372, 0.5625531941158517), (0.5626055249865471, 0.5625807147083433), (0.5626742075638334, 0.5626401289302878)]
Traceback (most recent call last):
  File "/apps/skylake/software/Python/3.8.5-gni-2020.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/apps/skylake/software/Python/3.8.5-gni-2020.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 184, in <module>
    test_linear_binning_theory(isa=isa)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 147, in test_linear_binning_theory
    test_bin_type(DD, autocorr, nthreads, binfile, x, y, z, weights1=w,
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 143, in test_bin_type
    assert allclose(*res)
AssertionError  
[rmin, rmax, nbins] = [5.10, 30.10, 25], isa =   FALLBACK Custom binning
Running with points in [xmin,xmax] = 0.000400,419.999817 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.000100,419.999512 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.000300,419.999207 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 27,27,13.  Time taken =   0.231 sec
Using fallback kernel
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken = 12.521 secs
Linear binning
Running with points in [xmin,xmax] = 0.000400,419.999817 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.000100,419.999512 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.000300,419.999207 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 27,27,13.  Time taken =   0.155 sec
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken =  7.410 secs
[nbin = 0] reference npairs = 15719370 computed npairs = 15719370
[nbin = 1] reference npairs = 19513442 computed npairs = 19513442
[nbin = 2] reference npairs = 23771964 computed npairs = 23771964
[nbin = 3] reference npairs = 28474552 computed npairs = 28474552
[nbin = 4] reference npairs = 33721542 computed npairs = 33721542
[nbin = 5] reference npairs = 39442494 computed npairs = 39442494
[nbin = 6] reference npairs = 45455368 computed npairs = 45455368
[nbin = 7] reference npairs = 51913026 computed npairs = 51913026
[nbin = 8] reference npairs = 58836836 computed npairs = 58836836
[nbin = 9] reference npairs = 66246674 computed npairs = 66246674
[nbin = 10] reference npairs = 74202316 computed npairs = 74202316
[nbin = 11] reference npairs = 82584722 computed npairs = 82584722
[nbin = 12] reference npairs = 91333442 computed npairs = 91333442
[nbin = 13] reference npairs = 100804634 computed npairs = 100804634
[nbin = 14] reference npairs = 110588016 computed npairs = 110588016
[nbin = 15] reference npairs = 120807052 computed npairs = 120807052
[nbin = 16] reference npairs = 131669182 computed npairs = 131669182
[nbin = 17] reference npairs = 142995722 computed npairs = 142995722
[nbin = 18] reference npairs = 154926102 computed npairs = 154926102
[nbin = 19] reference npairs = 167326390 computed npairs = 167326390
[nbin = 20] reference npairs = 180293032 computed npairs = 180293032
[nbin = 21] reference npairs = 193694562 computed npairs = 193694562
[nbin = 22] reference npairs = 207547282 computed npairs = 207547282
[nbin = 23] reference npairs = 222022944 computed npairs = 222022944
[nbin = 24] reference npairs = 236732684 computed npairs = 236732684
ESC[32mPASSEDESC[0m
ESC[0mntests run = 1 nfailed = 0ESC[0m

I have checked that the bins are identical between C and python.

(A related question is why does the python interface take so much longer? When running the python tests, the output ceases for quite a while after reaching 100%. And looks like the number of python- pairs is 4x the number of C pairs - which is suspiciously close to the number of threads)

manodeep commented 2 years ago

Okay, I've made some progress! I was able to reproduce a large failure and narrow it down to 3 particles. From there, I was able to find a bug in the sse42 kernel.

In the custom binning version, if a pair separation falls outside the bins, we mark this by setting a bit mask and setting separation to rmax. This happens here: https://github.com/adematti/Corrfunc/blob/eb3424a699771c077be900c78b1f7f7b8ead86d3/theory/DD/countpairs_kernels.c.src#L817-L825

But then in linear binning, we don't look at the bit mask, and just look at the separations. This causes the rmax separation to potentially fall in the last bin, depending on the exact floating point value of the inverse separation. This happens here: https://github.com/adematti/Corrfunc/blob/eb3424a699771c077be900c78b1f7f7b8ead86d3/theory/DD/countpairs_kernels.c.src#L835-L838

To keep the same pattern as the other kernels, the computed bins will have to be blended with 0 for the bins where the bin_mask is set. bin 0 is for separations < rmin or separations >= rmax. Adding something like m_rpbin = SSE_BLEND_FLOATS_WITH_MASK(m_zero, m_rpbin, m_mask_left); after line 836 should do the trick.

I admit I'm not sure I'm following the SSE code correctly. It's not immediately obvious why we both set the bit mask and set the separation to rmax. @manodeep, do you know why?

From what I recall, I chose rmax because that way I would always have to make sure I did not accidentally have a <= rmax condition somewhere. (I am pedantic that bins are half-open -- lower limits are inclusive, and upper-limits are exclusive, and that behaviour should not change for the last bin)

I'm also not sure if this is @manodeep's issue. It's definitely a bug (the reproducer is below), but it doesn't seem to depend on the compiler version, and it happens for SSE, not fallback. But I haven't yet checked if the fallback/AVX/AVX512 kernels use the same pattern.

I suspect this is different.

lgarrison commented 2 years ago

I am still puzzled by the failure in the FALLBACK kernel, and the mis-match between the python results (as in the actual number of pairs, not just between the custom and linear bin types within a testing harness) and the C results.

(A related question is why does the python interface take so much longer? When running the python tests, the output ceases for quite a while after reaching 100%. And looks like the number of python- pairs is 4x the number of C pairs - which is suspiciously close to the number of threads)

Great catch. I think this strongly suggests this is an OpenMP library/linking issue. For example, the Corrfunc C extension is linked against one OpenMP library, but another OpenMP library gets loaded when a shared library is dlopen()'d by Python during a package import. Maybe this could happen via Numpy, especially if it is backed by a multi-threaded BLAS. Or maybe via GSL (although that would just be a linking issue and not happen via dlopen()). I'm not clear on exactly the conditions under which this will cause a problem, but maybe it's something of this flavor.

Similar issues were reported in #197.

I would suggest disabling OpenMP in common.mk then retesting; I bet that will fix the issue. As to how to fix the issue with OpenMP enabled, I'm not sure what we can do from the Corrfunc side; it seems to depend on the compiler/other libraries. But let's confirm that it is an OpenMP issue first.

manodeep commented 2 years ago

Yup - after disabling OPENMP and reinstalling the python extension, the python tests pass! Hooorraay!

This is (potentially) very worrying! There are no tests on the python side - the assumption is that if the C tests are passing, then so should the python tests. We need to modify the call_correlation_function* codes into real tests. Think the actual calls are what the C tests are doing, we just need to add the comparisons with the known correct results.

lgarrison commented 2 years ago

@adematti is this something you could help with? The change is simply to modify this function (and the mocks one) to read in the correct results from the test directory (just like the bins) and assert that the results match.

This should be the last piece needed to close this PR.

adematti commented 2 years ago

Yes! I'll do that this weekend. Thank you so much for all the debugging!

lgarrison commented 2 years ago

Great -- thanks so much. @manodeep, can you confirm that these tests detect your issue? Then I think we are ready to merge.

manodeep commented 2 years ago

Sorry to delay this PR further, but I would prefer to separate out the additional testing harness into a different PR. We don't have any tests for the python interface - the call_correlation_functions were meant to show how to directly call the python extensions and were not intended to be actual tests. The python API is quite stable and it would be better to add in tests following the traditional python conventions (say pytests) and with filenames/function-names that conform to the usual conventions.

Related, we should deprecate and remove these call_correlation_functions scripts - these are no longer necessary and potentially provide the user with additional options that are not suitable for general use, perhaps as quickly as the next major release - i.e., v2.5.

manodeep commented 2 years ago

Should have added - @adematti, please feel free to say that you would prefer someone else to take on the testing PR

manodeep commented 2 years ago

An update on the race condition: switching to static OpenMP scheduling in countpairs_impl.c.src seems to reduce the severity of the issue - no longer 4x counts. Sadly, the tests still fail from python, and the number counts do not match what I see from C tests

lgarrison commented 2 years ago

Thanks, would be a nice bonus to get to the bottom of this issue! I can do the testing PR probably on Friday if you or @adematti don't have time to get to it before then. I think it won't be that much work now that the Python tests are hooked up to compare against the reference.

manodeep commented 2 years ago

@lgarrison My knowledge of python is really not good enough to pull off the testing PR - you can almost certainly do that way faster than me!

One thing I would recommend is to keep the testing code intentionally very simple - the python testing setup in this PR is quite complex

lgarrison commented 2 years ago

Okay, there was a bug in the C tests that took a while to track down. The bins were being written to file with the %g specifier, which uses only 6 digits by default even for doubles. More precision is needed to match the linear binning. Switching to an exact representation with %a fixed the issue.

So now the C test pass (including the linear binning integration tests), as does test_lin.py. I tried to support test_lin.py under Python 2.7, but the deeper I got, the more issues appeared. So while linear binning will likely run under Python 2.7, I think we'll just have to say the tests won't.

@manodeep, can you review test_lin.py and let me know if we are ready to merge?

manodeep commented 2 years ago

Okay, there was a bug in the C tests that took a while to track down. The bins were being written to file with the %g specifier, which uses only 6 digits by default even for doubles. More precision is needed to match the linear binning. Switching to an exact representation with %a fixed the issue.

Awesome - makes sense!

So now the C test pass (including the linear binning integration tests), as does test_lin.py. I tried to support test_lin.py under Python 2.7, but the deeper I got, the more issues appeared. So while linear binning will likely run under Python 2.7, I think we'll just have to say the tests won't.

Do you have details about the issues you encountered? I am happy to try and take a look...

If we can't test with python2.7, what do you think about only supporting other bin types in >= python3? That way, we also signal that we are planning to drop python2 support

lgarrison commented 2 years ago

Do you have details about the issues you encountered? I am happy to try and take a look...

One problem was that Python 2.7 didn't like some of our test function signatures. I'm sure it's possible to fix, but the changes would touch all the test functions.

If we can't test with python2.7, what do you think about only supporting other bin types in >= python3? That way, we also signal that we are planning to drop python2 support

Sure, we could turn off linear binning just for Python 2.7, but it probably works fine and I tend to see all Python 2.7 usage these days as "use at your own risk" since it's been discontinued for so long! We do have the warning of Python 2.7 Corrfunc deprecation in place, too.

manodeep commented 2 years ago

Ohh good point about the deprecation warning - yup that is a reasonable warning that using python2.7 is in the "user-beware" territory.

Okay - I am going to look through the code in a little more detail. I remember having some thoughts about how the code could be structured, naming conventions etc

lgarrison commented 2 years ago

Sounds good! Let me know if I can help.

manodeep commented 2 years ago

It is embarrassing how long this PR is taking me to review - just commenting to say that I am doing my best to get the review done

manodeep commented 2 years ago

@adematti I am happy to review the remaining code - would that be useful?

manodeep commented 2 years ago

@lgarrison Not sure if @adematti has notifications enabled or perhaps is slammed. How do you want to proceed on this PR? Seems like there might be a good use-case and the speedups are fairly reasonable from what I recall

adematti commented 2 years ago

Hi @manodeep, sorry for the delay (saw your message, then got extremely busy). I think this is up to you and Lehman! (the Corrfunc version we are using within DESI has evolved significantly to address new weighting schemes & other line-of-sight definitions).

lgarrison commented 2 years ago

I'm happy to merge this, it's a great feature as-is! @adematti, are there any bugs you found in the linear binning that might have been addressed in the DESI fork but not here?

adematti commented 2 years ago

yes, I was just looking at this. The only bug I remember of was due to this line: https://github.com/adematti/Corrfunc/blob/6ae2b9f4e470558ceb0b1d59fb0c2f62ff4bdbdc/mocks/DDtheta_mocks/countpairs_theta_mocks_kernels.c.src#L924 which should be: if (need_rpavg || bin_type == BIN_LIN) { (see https://github.com/adematti/Corrfunc/commit/8d95155dfb9da1f350cd8e318f6358e55f3a0018)

manodeep commented 2 years ago

Ahh thanks @adematti :)

@lgarrison Do you want to merge this PR into a separate branch (we have one already right?) and then sort it out from there?

manodeep commented 2 years ago

yes, I was just looking at this. The only bug I remember of was due to this line: https://github.com/adematti/Corrfunc/blob/6ae2b9f4e470558ceb0b1d59fb0c2f62ff4bdbdc/mocks/DDtheta_mocks/countpairs_theta_mocks_kernels.c.src#L924 which should be: if (need_rpavg || bin_type == BIN_LIN) { (see adematti@8d95155)

@adematti This fix s already in the PR right?

lgarrison commented 2 years ago

@manodeep I think this branch still needs to be fixed; I'll do that in the next day or two.

adematti commented 2 years ago

(I found myself a couple of minutes to fix the bug)

lgarrison commented 2 years ago

@manodeep Checks are passing; let's merge?