BUG DDrppi buggy for cross-corr in periodic spaces

beckermr commented 4 years ago

General information

Corrfunc version: 2.3.2
platform: osx
installation method (pip/source/other?): conda

Issue description

I am computing the pairs by hand and not getting what corrfunc reports.

Expected behavior

It should match the brute force computation.

Actual behavior

It doesn't match it.

What have you tried so far?

A bunch of stuff to try and see if it will miss one particle. It does not, so the bug is more subtle. It does not appear to be a bin edge thing.

Minimal failing example

import numpy as np
import Corrfunc

lbox = 120
zmax = 60
n_pi = int(zmax)
rmin = 40
rmax = 45
n_r = 2
rbins = np.array([0, rmin, rmax])

lbox_2 = lbox / 2

npts = 11
rng = np.random.RandomState(seed=42)
x1 = rng.uniform(low=0, high=lbox, size=npts)
y1 = rng.uniform(low=0, high=lbox, size=npts)
z1 = rng.uniform(low=0, high=lbox, size=npts)
w1 = rng.uniform(size=npts) * 0 + 1

nparts = 100
px1 = rng.uniform(low=0, high=lbox, size=nparts)
py1 = rng.uniform(low=0, high=lbox, size=nparts)
pz1 = rng.uniform(low=0, high=lbox, size=nparts)

print("\n")

dd_cf = Corrfunc.theory.DDrppi(
    False,
    1,
    zmax,
    rbins,
    x1,
    y1,
    z1,
    weights1=w1,
    X2=px1,
    Y2=py1,
    Z2=pz1,
    weights2=np.ones_like(px1),
    periodic=True,
    boxsize=lbox,
    weight_type="pair_product",
    verbose=True,
    isa="fallback",
)
dd_cf = dd_cf["weightavg"].reshape((n_r, n_pi)) * dd_cf["npairs"].reshape(
    (n_r, n_pi)
)

print("\n")
print("testing particles one by one...")

dd = np.zeros(n_r)
for hind, (hx, hy, hz) in enumerate(zip(x1, y1, z1)):

    for px, py, pz in zip(px1, py1, pz1):
        dx = np.abs(hx - px)
        dy = np.abs(hy - py)
        dz = np.abs(hz - pz)

        if dx > lbox_2:
            dx = lbox - dx
        if dy > lbox_2:
            dy = lbox - dy
        if dz > lbox_2:
            dz = lbox - dz

        if dz <= zmax:
            r = np.sqrt(dx * dx + dy * dy)

            if r < rmin:
                dd[0] += w1[hind]
                if dd_cf[0, int(dz)] == 0:
                    print(
                        "    def missed part:",
                        r,
                        dz,
                        int(dz),
                        dd_cf[1, int(dz)],
                    )

            elif r >= rmin and r < rmax:
                dd[1] += w1[hind]
                if dd_cf[1, int(dz)] == 0:
                    print(
                        "    def missed part:",
                        r,
                        dz,
                        int(dz),
                        dd_cf[1, int(dz)],
                    )

print("dd corrfunc:", np.sum(dd_cf, axis=1))
print("dd brute   :", dd)

dd_cf_sum = np.sum(dd_cf, axis=1)
assert np.array_equal(dd_cf_sum, dd)

This script outputs

(anl) clarence:Desktop beckermr$ python corrfunc_test.py 

[Warning] The CPU supports AVX2 but the compiler does not.  Can you try another compiler?
ND1 =           11 [xmin,ymin,zmin] = [2.470139,16.739263,5.574050], [xmax,ymax,zmax] = [114.085717,116.389182,94.221115]
ND2 =          100 [xmin,ymin,zmin] = [0.662654,0.607390,1.727219], [xmax,ymax,zmax] = [118.426432,118.278054,118.806462]
Running with points in [xmin,xmax] = 0.662654,118.426432 with periodic wrapping = 120.000000
Running with points in [ymin,ymax] = 0.607390,118.278054 with periodic wrapping = 120.000000
Running with points in [zmin,zmax] = 1.727219,118.806462 with periodic wrapping = 120.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 5,5,2.  Time taken =   0.000 sec
countpairs_rp_pi_double> gridlink seems inefficient. nmesh = (5, 5, 2); avg_np = 0.22. Boosting bin refine factor - should lead to better performance
xmin = 0.662654 xmax=118.426432 rpmax = 45.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 8,8,2.  Time taken =   0.000 sec
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 8,8,2.  Time taken =   0.000 sec
Using fallback kernel
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken =  0.001 secs

testing particles one by one...
    def missed part: 41.087356477185 48.94941175529149 48 0.0
dd corrfunc: [362.  93.]
dd brute   : [387.  96.]
Traceback (most recent call last):
  File "corrfunc_test.py", line 99, in <module>
    assert np.array_equal(dd_cf_sum, dd)
AssertionError

beckermr commented 4 years ago

@aphearin @manodeep I have been banging my head against this for a week. Here is my most minimal example of something I don't understand. Hopefully, it is my code!

manodeep commented 4 years ago

Thanks for the bug report @beckermr - let me check whether I can reproduce it. I did have a couple of questions though - i) Does the bug still appear if you reduce zmax to a value smaller than Lbox/2, say 55 ? and ii) Can you reproduce the bug with Corrfunc installed from source/pip? (Related, I thought that the conda installed Corrfunc was called corrfunc - is that not the case?)

lgarrison commented 4 years ago

Just confirming that I can run your script and get the same output; thanks for making the minimal reproducer! Haven't had a chance to track down the bug yet.

On Tue, Jan 28, 2020 at 3:04 PM Manodeep Sinha notifications@github.com wrote:

Thanks for the bug report @beckermr https://github.com/beckermr - let me check whether I can reproduce it. I did have a couple of questions though - i) Does the bug still appear if you reduce zmax to a value smaller than Lbox/2, say 55 ? and ii) Can you reproduce the bug with Corrfunc installed from source/pip? (Related, I thought that the conda installed Corrfunc was called corrfunc - is that not the case?)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/210?email_source=notifications&email_token=ABLA7S6STYSQOINX2HARYGTRACFWLA5CNFSM4KMVBGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKEWPHQ#issuecomment-579430302, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S3JT5UZ7TAWWHTCM4TRACFWLANCNFSM4KMVBGQA .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

beckermr commented 4 years ago

@manodeep yes I can reproduce the bug in both of those cases.

manodeep commented 4 years ago

I suspect this is connected to the relative size of rmax and zmax to Lbox. For instance, if I change (xbin, ybin, zbin) refine-factors to (1, 1, 1), then we have many mis-matches. And if I increase all the bin-refine-factors to >= 3, then the bug disappears. Since allowing such large relative separations is a new feature introduced with #189 (via commit https://github.com/manodeep/Corrfunc/commit/e53bce3050139337dae5723a3601e707f1f37920), may be a bug was introduced.

@lgarrison At minimum, we will need a separate test case (from a brute force output) to test these sorts of large separations.

manodeep commented 4 years ago

If I disable the duplicate cell-checking in here, then the bug disappears. However, we need to work through the logic and make sure we are not double-counting under any circumstances...

lgarrison commented 4 years ago

@manodeep is it possible that we we are missing particles when we do duplicate cell elimination because we are only considering one "wrapping" direction? I think we need to consider both, the question is how to do so without double counting pairs. Probably the answer lies in some restriction llike Rmax < CellSize/2...

aphearin commented 4 years ago

Once this is resolved, do y'all have a brute force pair-counter test in the unit-testing to protect against regression? I find it's really hard to fool those kind of tests. Let me know in case that would be useful for me to pass along your way.

manodeep commented 4 years ago

@lgarrison That is definitely what is happening here. Consider zmax = Lbox/2, and we have 2 bins. In that case, the top half of iz=0 can be within zmax of the bottom half of the iz=1, and the bottom half of the iz=0 can be within the top half of iz=1 cell after periodic wrapping. However, we will only associate (iz=0, iz=1) once and without periodic wrapping. Essentially the duplicate check has to needs to make sure both directions of the cell-pairs are included. For this specific scenario, no double-counting should occur but I am not sure about all the possible combinations of bin-refs, min-sep-opt, and various zmax/Lbox ratios ranging between [0.3-0.5] (i.e, between 2 <= nzbins <= 3. I am reasonably confident that once nzbins > 3, Corrfunc works correctly. We would need to run the tests with the COMPREHENSIVE_INTEGRATION_TESTS` option enabled.

@aphearin Thanks for the offer - we will absolutely need to add tests for large Rmax/zmax for all the pair-counters.

@lgarrison In the meantime, should we revert the previous change (introduced in v2.3.2) for allowing large ratios of (Rmax, zmax)/Lboox?

lgarrison commented 4 years ago

@manodeep Yes, I think since we happen to be making a release right now anyways we should revert this change while we consider the best way to fix this. Let's double-check with @beckermr that it fixes his problem before releasing!

manodeep commented 4 years ago

@lgarrison If we revert the change, then Corrfunc will report an error for (nx, ny, nzbins < 3) and @beckermr will not be able to run this example

manodeep commented 1 year ago

@beckermr This should be fixed now with #277. As long as Rmax < Lbox/2, Corrfunc should correctly compute the pairs.

Closing the issue but please feel free to reopen if you are still facing any issues

beckermr commented 1 year ago

Thank you!

manodeep / Corrfunc