manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

WIP: enable larger Rmax (up to half the box size) #277

Closed lgarrison closed 1 year ago

lgarrison commented 1 year ago

In looking at the gridlink/cell pair code for #276, I was reminded that we are explicitly computing the offset between pairs of cells e.g. here: https://github.com/manodeep/Corrfunc/blob/c3e260d70888c2cdc9e772b5cd97e59068d2522e/utils/gridlink_impl.c.src#L504

Which made me think: if we know the offset of each cell pair, and that offset is applied in the kernels, then what's the harm in adding duplicate cell pairs? They will not find duplicate particle pairs, because a cell pair can be added at most once with a positive offset, and once with a negative. So as long as we enforce Rmax < boxsize/2, there ought not to be duplicate counts.

So allowing duplicate cell pairs, the usual tests seem to be passing (but those have small Rmax). I did add a simple large Rmax test with two particles, which seems to work.

This is also related to the old issue #210. I did test that reproducer, which fails with an "Rmax too large" error in master but succeeds with this PR.

@manodeep Am I missing something? Is there a counterexample where this doesn't work?

This is relatively low priority, but it does relate to the question of the conditions to avoid duplicates in https://github.com/manodeep/Corrfunc/pull/276#issuecomment-1189176506 (or rather, removes that question entirely).

manodeep commented 1 year ago

I don’t think we can add duplicate cell pairs.

Consider a set of points - one that is close to the left edge and another that is close to the right edge (and where Rmax << Lbox). If we add duplicate cell-pairs (with a positive and a negative offsets), then that particle pair would be counted twice.

lgarrison commented 1 year ago

But isn't the particle distance with the negative offset very small, and it is thus counted, and the distance with the positive offset very large, and thus not counted?

I implemented it as a test here: https://github.com/manodeep/Corrfunc/pull/277/files#diff-c7d4836bf94aa48cd9d4e4410ddab869be6ec08d83dfc0999ee5eac26cd8d30cR158

and it seems to work correctly. I may have misunderstood you, though. Could you look and see if the test is actually checking the scenario you had in mind?

(adding a milestone just to make the CI happy)

manodeep commented 1 year ago

The distance between the two points would be the same in both the positive and negative offset cases. In one case the particle close to the right edge is shifted by xwrap towards the left edge and therefore ends up to the left of the second particle.

For a more concrete 1D example, say two particles A and B are located at x_A = 0 + alpha, and x_B = Lbox - alpha.

This new distance with the negative offset would be the same as computed for the positive offset case. Therefore, for an autocorr with duplicated cell-pairs, particle pairs would get repeated

lgarrison commented 1 year ago

Sorry, I think I confused the issue by saying "positive offset" and "negative offset". In the case of an autocorrelation, the negative offset is actually "no offset". Offsets are only applied to achieve periodicity (as all particle positions are global), and one of the cell pairs crosses the wrap (the positive offset) and the other does not (no offset).

So in the two-particle example you gave, we have cell A and cell B. The cell pair generation goes as:

  1. We look for cell pairs using cell A as the primary, and find none because of the usual autocorrelation optimization to only consider half the cell pairs (icell < icell2 in the gridlink code).
  2. We look for cell pairs with cell B as the primary, and find two:
    • One at index offset -1, which is not a negative positional offset but is zero offset, because we did not have to cross the wrap
    • One at index offset +1, which does have a positive offset because it cross the wrap.

Only the second cell pair brings the particles adjacent to each other, so only one particle pair is found, and the count is doubled at the end (the usual doubling since the autocorr considers half the particle pairs).

The cross-correlation case is the same, except we find two cell pairs that generate counts, and the counts are not doubled at the end.

So I think this all works because we know whether we had to cross the wrap to make a cell pair. Thus "duplicate" cell pairs don't have the same positional offset.

I've stepped through the cell pair generation pretty closely, and I think the min-sep-opt will need to be re-validated for this PR to be production-ready. But the more I look at this, the more I think the basic principle is sound.

manodeep commented 1 year ago

Hmm - I may have misunderstood - let me back up. I thought you were saying that adding duplicate cell-pairs was okay (i.e., adding both (cellA, cellB) and (cellB, cellA) as two separate cell-pairs) and would not result in duplicated particle pairs. Is that the case?

lgarrison commented 1 year ago

Yes, because any time we add a duplicate cell pair, it is with a different offset (xwrap value).

manodeep commented 1 year ago

Not following then how you could avoid duplicate pair-counts if you add duplicate cell-pairs. Let's pretend that we manually add a duplicate cell-pair corresponding to a real one for an auto-corr calculation. In the case without any offsets (i.e., cell-pairs where no domain edges are crossed), how could you avoid duplicating the pair-counts that were already counted in the first real cell-pair?

In the extreme case, where there is only one cell in the domain, then duplicating the cell-pair (i.e., the self-cell-pair), would certainly result in duplicate particle pair counts.

I must be missing something :(

lgarrison commented 1 year ago

Oh, I definitely agree that if you add a truly identical cell pair—same icell & icell2, and same xwrap—you get duplicate particle counts! My point is just that generate_cell_pairs does not produce such truly identical pairs.

In fact, maybe this suggests the most conservative change is to adjust the duplicate detection to take into account the x/y/zwrap values. I've implemented this in the latest commit. I've also added tests for multiple combinations of bin refinement, max_cells_per_dim, min_sep_opt, and autocorr. What do you think about this approach?

lgarrison commented 1 year ago

Hey @manodeep, have you had a chance to look at the new approach to this problem?

manodeep commented 1 year ago

Thanks for your patience @lgarrison - on parental leave at the moment, so time is extremely limited.

I took another look at the PR - I am not really sure that I understand the improvement. Thus far, Corrfunc has required a minimum of 3 cells for any calculation - otherwise, it fails with "Rmax too large" error.

I am not seeing how a duplicate cell-pair could be added with different offsets. For an autocorr, a cell-pair (icell, icell2) should always be added with the same set of three offsets. The duplicate cell check macro only checks for the same cell-pair with the first cell as the primary cell - so the offsets should always be the same (i.e., there is no need to pass the xwrap into the macro). Now, if we consider the opposite cell-pair, i.e., (icell2, icell) then the relevant offsets would be negated but then would fail the icell2 > icell test. Once again, I feel that I am missing something ...and now I can blame the sleep-deprived brain :)

lgarrison commented 1 year ago

Thanks for your patience @lgarrison - on parental leave at the moment, so time is extremely limited.

Oh, no worries! There's no rush.

I took another look at the PR - I am not really sure that I understand the improvement. Thus far, Corrfunc has required a minimum of 3 cells for any calculation - otherwise, it fails with "Rmax too large" error.

The improvement is to remove this 3-cell minimum requirement, and thus allow all Rmax < L/2. It also simplifies the Rmax logic when we have per-axis boxsizes, as in #276, which is why I started thinking about this.

I am not seeing how a duplicate cell-pair could be added with different offsets. For an autocorr, a cell-pair (icell, icell2) should always be added with the same set of three offsets. The duplicate cell check macro only checks for the same cell-pair with the first cell as the primary cell - so the offsets should always be the same (i.e., there is no need to pass the xwrap into the macro). Now, if we consider the opposite cell-pair, i.e., (icell2, icell) then the relevant offsets would be negated but then would fail the icell2 > icell test.

Consider the case of two cells, evenly dividing the box. A particle pair could (1) straddle the boundary in the middle of the box, or (2) straddle the periodic wrap. Thus, we want to add the pair (icell2, icell) in duplicate, with two different offsets: zero, and -L. The zero offset will detect (1); the -L offset will detect (2).

(Recall the gridlink "offset" is the adjustment applied to the particle separations, rather than the distance between cell centers)

So that's why we want to add duplicate cell-pairs with different offsets: to detect both those cases. As you point out, icell2 will be the primary in both (1) and (2) because of the icell2 > icell test.

If you're asking if the duplicate macro check does anything at all, I think the answer is no! But it sounded like in https://github.com/manodeep/Corrfunc/pull/277#issuecomment-1195394749 you were concerned about adversarial examples, hence keeping the check in place.

This same example is stepped through more carefully in https://github.com/manodeep/Corrfunc/pull/277#issuecomment-1192830279. A version of this is also implemented here for concreteness.

manodeep commented 1 year ago

@lgarrison Think I have finally understood where the update is relevant - the check for duplicate cells was throwing me off. If I understand correctly, the fix is relevant when the number of bins along any dimension is smaller than the 2*bin_refine_factor + 1 . Specifically, this would be relevant for cases with bin_refs = [1, 1, 1] but rmax such that only two cells are generated. In such a case, you want to add the cell-pair [0, 1] with no positional offset, and then add the (duplicate) cell-pair [0,1] but with a negative positional offset. That's why the duplicate check requires the xwrap/ywrap/zwrap. Have I understood correctly?

If so, then perhaps the check for duplicate cells could be modified to only run when nxbins <= 2*xbinref || nybins <= 2*ybinref || nzbins <= 2*zbinref?

lgarrison commented 1 year ago

@lgarrison Think I have finally understood where the update is relevant - the check for duplicate cells was throwing me off. If I understand correctly, the fix is relevant when the number of bins along any dimension is smaller than the 2*bin_refine_factor + 1 . Specifically, this would be relevant for cases with bin_refs = [1, 1, 1] but rmax such that only two cells are generated. In such a case, you want to add the cell-pair [0, 1] with no positional offset, and then add the (duplicate) cell-pair [0,1] but with a negative positional offset. That's why the duplicate check requires the xwrap/ywrap/zwrap. Have I understood correctly?

Yes, exactly!

If so, then perhaps the check for duplicate cells could be modified to only run when nxbins <= 2*xbinref || nybins <= 2*ybinref || nzbins <= 2*zbinref?

I think this is already the case. The check_for_duplicate_ngb_cells variable does exactly this test: https://github.com/manodeep/Corrfunc/pull/277/files#diff-96f43764787ed164aaf40bc5cb9f61d347d18e748d51a6b29346f3865392eb94R484-R487

So do you think this means this PR is ready to go? I think we can also close #276 pretty quickly after this is merged.

manodeep commented 1 year ago

If so, then perhaps the check for duplicate cells could be modified to only run when nxbins <= 2*xbinref || nybins <= 2*ybinref || nzbins <= 2*zbinref?

I think this is already the case. The check_for_duplicate_ngb_cells variable does exactly this test: https://github.com/manodeep/Corrfunc/pull/277/files#diff-96f43764787ed164aaf40bc5cb9f61d347d18e748d51a6b29346f3865392eb94R484-R487

Can't tell what that code fragment points to and reading the macro definition didn't show any signs of the if condition I mentioned. What am I missing?

lgarrison commented 1 year ago

Now that I (mostly) understand what's going on, the changes seems good. Perhaps, add one more test with ~100 particle positions, say 50 each around [$0 \pm \delta$, $Lbox/2 \pm \delta$] mod Lbox and check that the pair counts from Corrfunc match those from a brute-force.

Good idea; I've added this test, comparing against brute force, and it passes.

Can't tell what that code fragment points to and reading the macro definition didn't show any signs of the if condition I mentioned. What am I missing?

Hmm, looks like that link is broken. Here is one that should work: https://github.com/manodeep/Corrfunc/blob/e45f5db100dcf4150854834454b9f4a7030e4f1f/utils/gridlink_impl.c.src#L484-L487

The macro is only called if this condition holds.

manodeep commented 1 year ago

Thanks for your patience - yup, looks good to me now! Once the CI passes, this can be merged

lgarrison commented 1 year ago

Great, thanks! I'll revisit #276 next, once I remind myself what I was doing there 😅