manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
165 stars 50 forks source link

Combining DDs when calculating a jackknife #125

Closed tmcclintock closed 7 years ago

tmcclintock commented 7 years ago

Hello again! I would like to use Corrfunc to calculate correlations between jackknife subregions in a snapshot. This would result in a whole bunch of calls the DD, both autocorr for all subregions and crosscorrelations between subregions. I am fine resumming DDs appropriately, but my question is - is there a function similar to convert_3d_counts_to_cf that only needs DDs and doesn't also need DRs and RRs?

In my case, since I'm doing a jackknife, I need to do a "leave-one-out" calculation for each subregion. This means the geometry isn't a perfect cube, so I can't call xi() directly. Instead I have to find all the DDs individually between the remaining subregions. If I have to I can generate randoms and find all the DRs and RRs, but if there is a way where I don't have to that would be very helpful. Thanks!

manodeep commented 7 years ago

There is no pre-canned routine in Corrfunc that will solve the issue for you. But I am fairly sure halotools has the what you want. @aphearin ?

aphearin commented 7 years ago

@tmcclintock - yes, there is a function in halotools that does what you need, it's called tpcf_jackknife.

>>> from halotools.mock_observables import tpcf_jackknife

You can find the documentation here. The function is written so that repeated passes over the data are not necessary to do the "leave-one-out" calculation - only a single pass is made. I think the API is the simple one that you want: you just pass in your data and instructions for how to do the subvolume division, and you are returned the appropriate covariance matrix.

tmcclintock commented 7 years ago

Great, thank you so much!

lgarrison commented 7 years ago

Just noting that halotools implements jackknifing with a tag scheme to define whether a particle is inside or outside the jackknife region. If someone were interested in having this functionality in Corrfunc, they could write a custom weighting function where one of the weight fields is interpreted as a tag and proceed as halotools does.

tmcclintock commented 7 years ago

Is that necessary though? Admittedly, I don't know all the intricacies of Corrfunc, but don't you split the particles into subregions of size rmax, and then couldn't you jackknife with those subregions? Obviously this would mean that the covariance for the largest bins aren't trustable at all, but I think it would require minimal additions to the code as is.

lgarrison commented 7 years ago

It's true that we do already have a grid decomposition set by rmax (up to the bin refine factors) and that we could re-use that grid for jackknifing. However, this means the size of your jackknife regions is set by rmax and the bin refine factors, which may not always be what you want. The weight/tag approach is nice because it allows for completely arbitrary jackknife regions, and the code only has to be modified with an additional weight function, which Corrfunc was designed to support.

aphearin commented 7 years ago

There's also a performance issue with the proposed workaround of forcing rmax to be tied to subvol_size. If you wanted to jackknife using, say, subvolumes that are 1/3 the size of the box, but you're only interested in the tpcf on ~10Mpc scales, you'll take a giant performance hit from counting pairs out to such large scales. The tagging approach allows you to maintain Corrfunc's optimized choices for cf cell sizes.

Halotools uses the exact same algorithm as used in Corrfunc, just implemented in Cython and not as hyper-optimized, so in principle this is a straightforward addition to the framework.

tmcclintock commented 7 years ago

@lgarrison perhaps I'll try to implement it myself next week. Thanks for all your help.

tmcclintock commented 7 years ago

@lgarrison so implementing the weight function that does the jackknifing seems reasonable enough, but the interesting thing from the user standpoint is the covariance matrix. How would this be returned to the user when they call, for instance, xi()?

lgarrison commented 7 years ago

I don't think there's any way to do it without adding an additional return value. At the Python level I think that's okay though; we already conditionally return api_time only if c_api_timer is set, for example. The C API packs the results into a struct called results_countpairs_xi (using xi as an example), so the covariance matrix could be added as an additional field in that struct. The Python interface would separate that matrix into its own return value.

Probably the best thing would be to add a separate argument like jackknife_nsub to the relevant interfaces (C and Python), in analogy with the Nsub argument to halotools tcpf_jackknife. This would then automatically generate jackknife weights/tags for the particles. Alternatively, the user could specify weight_type=jackknife and provide their own tags. In either case, a jackknife covariance matrix will be returned.

tmcclintock commented 7 years ago

I had started implementing the latter, but I actually don't think it's the right way to go. Specifically because if we used weight_type=jackknife then there is no way to have actual weights on the particles when doing the jackknife.

lgarrison commented 7 years ago

A quick hack would be to implement the weight scheme jackknife_pair_product, which accepts two weights per particle instead of one, where the first weight is the tag and the second weight is the real weight value. For pair_product this is fine, but ultimately it would require writing a jackknife weight scheme corresponding to every new weight scheme that came along.

The ideal solution would be to accept lists of Jtags that are separate from the weights, just like halotools does. The jackknifing would then be applied as "post-processing" of whatever value the weight function returned. The code doesn't have a mechanism in-place for this kind of post-processing, and we'd need to think about how flexible this would need to be (i.e. can we hard-code halotools-style jackknife post-processing, or does it need to be user-customizable like the weights?).

@manodeep?

manodeep commented 7 years ago

I think this will require a little bit of code but is do-able. The biggest challenge is to get the particle level sub-volume tags onto the grid, and then a desired covariance matrix as an output (while maintaining compatibility and minimum API breaks).

As @lgarrison mentioned, we do attach an arbitrary set of weights per particle. You would need to define multiple weights per particle (I think the default max weight per particle is ~10) and then within your custom jack-knife weights function, return the appropriate pair-counts.

manodeep commented 7 years ago

I like the (completely) user-customizable weights within Corrfunc.

Another idea is:

aphearin commented 7 years ago

Just a qualitative comment to keep in mind here. I fully appreciate that performance is a core principle of Corrfunc design, but do remember that there are zero use-cases where a jackknife correlation function is done inside a likelihood analysis. So, given finite human wall-clock time, the jackknife calculation doesn't need to be shined down to the bone the way that, for example, wp has been shined. I just wanted to share that perspective before this rabbit hole gets too deep ;-)

manodeep commented 7 years ago

@aphearin Agreed.

Though, in this particular case, the jack-knife features can be implemented without re-jiggering any of the kernel level internals of Corrfunc. You should still get the ~ same raw speeds of a weighted correlation function.

@tmcclintock Do you have to do the jack-knife multiple times? And is halotools not sufficient for the purpose?

manodeep commented 7 years ago

@tmcclintock Just to be clear, we would love to have a jack-knife implementation within Corrfunc and will happily assist in the design. Simply be mindful of the time-commitments...

manodeep commented 7 years ago

@tmcclintock Checking in. Did you need any help on this? Or have you solved the jack-knife issue (one way or another)?

tmcclintock commented 7 years ago

Ah my apologies. I ended up using different tools, so I never got to implementing this. To be honest I doubt I will have time to :(.

manodeep commented 7 years ago

@tmcclintock No worries at all; we only have so much time :). Glad that you got it sorted out.