manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

unnecessary vpf check? #237

Closed lgarrison closed 3 years ago

lgarrison commented 3 years ago

There's a check in vpf to make sure the total volume in spheres doesn't exceed the box volume: https://github.com/manodeep/Corrfunc/blob/master/Corrfunc/theory/vpf.py#L207

But I'm not sure that condition is necessary? Spheres are allowed to overlap in vpf, right?

@manodeep do you know why this is there?

manodeep commented 3 years ago

it's a sanity check to make sure that the total number of spheres requested is sensible. If the total volume exceeds the box volume, then there must be some non-independent data-points, and therefore, the total number of spheres used is no longer a fair representation of the amount of scatter expected. In those cases, it is betterr to use a smaller number of random spheres (which might overlap). Does that make sense?

lgarrison commented 3 years ago

I guess I was thinking about it like this. The VPF tells you what the probability is of having 0 (or N) particles in a sphere thrown down at random in a box. So if we throw down as many such random spheres as possible, isn't the fraction with N exactly what we want to know?

I agree a single box is only going to be a single sampling of the "true" VPF, but I would think limiting the number of spheres just gives you a noisier estimate of the VPF, not a less biased one.

Perhaps the real issue is that one shouldn't trust the large-radius VPF measurements from one box, but maybe one wants to be able to get a low-noise estimate of it, even if it is biased, so one can average over multiple boxes, for example. And at small scales, maybe it's useful to be able to "oversample" to beat down noise some more when it's at a radius that you do trust the measurement.

manodeep commented 3 years ago

I am not sure how repeatedly re-sampling the same volume would give a less noisy estimate of the (large-R) VPF. If anything, the larger number of spheres will make you underestimate the error on the VPF. Of course, the only way to really get the scatter in the VPF is by looking at the scatter in the measurement over many different realisations.

Having said all that, I have never tested the bias-variance - seems like an easy enough test to run and check whether limiting the total volume is meaningful. And I can certainly see an use-case for a keyword parameter that allows the user to over-ride the volume check.

lgarrison commented 3 years ago

For the large-R VPF, if one is going out to 20% of the box, for example, then that's only 30 spheres. So the granularity of the VPF is only 3%! That's a pretty extreme case, but perhaps we should let the user make this choice. What do you think about changing the error to a warning? That's slightly less overhead than a new parameter, but I'm also fine with that.

manodeep commented 3 years ago

If the user really wants to probe Rmax to 20% of Lbox, then restricting the total number of spheres seems like a good idea!

Though I like the idea of changing to a warning, I have found that people just ignore warnings. It would be a little more work to put in a override_volume_check keyword, but at least that will ensure that users have to consciously make that choice.

lgarrison commented 3 years ago

Well, even a much more reasonable case, like 5% of the box, is only 2000 spheres, which may not be enough to capture rare small-scale events that may nonetheless hold interesting information.

If the user wants the vpf across a range of Rmax scales, we might say from a computational perspective that it makes sense for the user to run vpf many times, decreasing nspheres as one increases Rmax. But the structure of the API doesn't really suggest that, since we are providing nbins to allow simultaneously measuring across many scales. So... given that the vpf is cheap, I am in favor of lifting the restriction on spheres altogether and making it a warning! What do you think?

On Tue, Nov 10, 2020 at 5:08 PM Manodeep Sinha notifications@github.com wrote:

If the user really wants to probe Rmax to 20% of Lbox, then restricting the total number of spheres seems like a good idea!

Though I like the idea of changing to a warning, I have found that people just ignore warnings. It would be a little more work to put in a override_volume_check keyword, but at least that will ensure that users have to consciously make that choice.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/manodeep/Corrfunc/issues/237#issuecomment-724996707, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLA7S77LXC5TQCUKYZPVDLSPG2WTANCNFSM4TNEVXUA .

-- Lehman Garrison lgarrison@flatironinstitute.org Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

manodeep commented 3 years ago

Are you concerned with the extra code required to over-ride the default behaviour? To me, it seems like adding a boolean keyword override_volume_check or allow_oversampling would not be that much code but would let the user get the desired behaviour. Regardless, I do agree that the behaviour might be desirable - I am trying to avoid making any default choices that can impact the fidelity of the results.

I am not sure how you would capture a rare, small-scale event with the vpf? The measurement (i.e., the mean of nspheres) is unlikely to be affected much, the standard deviation might be - is that the kind of effect you are thinking about?

lgarrison commented 3 years ago

Yes, it's a combination of not wanting to add more surface area to the API, and a lingering feeling that this check shouldn't be there at all. The condition forces you to miss a lot of sphere configurations!

By "rare events", I mean low-amplitude counts-in-cells. I think for any radius (doesn't have to be small), if you go to high enough (or small enough) counts, the probability p(N) will be very low. And then you probably want at least 10/p(N) spheres to get above Poisson noise. But the volume condition doesn't know anything about p(N), so in many cases it won't let you use the appropriate number of spheres!

manodeep commented 3 years ago

Seems like you feel fairly strongly about this :). Changing to a warning falls under the same scope as checking for the ratio - Rmax/Lbox in the correlation function caiculations - something we do not do currently.

Since the change will alter the default behaviour, should we milestone for a minor release (i.e., 2.4.0) rather than a patch release?

lgarrison commented 3 years ago

Sorry! I swear I wasn't planning on dying on this hill... I haven't worked with the VPF before, so I'm just trying to think it through and make sure I've wrapped my head around what's going on. And hopefully doing so will help newcomers to the code, especially since I think many students use corrfunc!

So do you think you agree with my assessment? If so, then yes, 2.4.0 sounds good to me. Do you think we could also do the boxsize change (#199) in that release?

manodeep commented 3 years ago

Hehehe - definitely seemed like the hill :). All good - let's milestone this change for 2.4.0. Did you want to PR this up?

Seems like the boxsize PR has some merge conflicts - are you able to fix those? Otherwise, I think that one is good to go but I will take one final look (it's been a while)

I am assuming this means that the next release will be v2.4.0. There are a lot of issues that are milestoned for 2.4.0 - we might have to go through and check what's feasible

lgarrison commented 3 years ago

Yes, sounds good, I'll take a look tomorrow.

laperezNYC commented 3 years ago

Hi @manodeep! I'm Lucia, and I'm the one who brought up this issue to @lgarrison while using CORRFUNC for my current projects. I was throwing CORRFUNC at much smaller volumes that I'm sure you wrote this for, and was struggling with the code stumbling for radii where I knew the VPF could still be trustworthy and measured, and finding it was giving inconsistent results because of the undersampling the check imposed. (So if anyone is dying on this hill, it's me, lol.)

My paper that's in press right now is very relevant here, and @lgarrison 's mentioned some things that I brought up to him from my work: https://arxiv.org/abs/2011.03556. Specifically, Section 3.3 discusses some common-sense limits of when the VPF can be measured given a sample's density and volume, which quantifies the concern of when a radius is too large. Still, regardless of whether the radius you're measuring the VPF for is valid or can give accurate results, one should still oversample with the dropped test spheres to at least guarantee that the act of measuring the VPF isn't adding additional uncertainty.

I'm about to PR a branch with the updated vpf.py that completely omits the check, though I did draft one earlier that gave folks the option to override the check with a flag. CORRFUNC is awesome and I'm happy to contribute a tiny improvement to it!

manodeep commented 3 years ago

Hi @laperezNYC - my name is Manodeep - nice to e-meet you! Thank you for using Corrfunc and even more thanks for contributing an improvement.