manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
165 stars 50 forks source link

special handling of linear binning #258

Closed adematti closed 2 years ago

adematti commented 2 years ago

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins. There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200. I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning (which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85). The following pair counters have been updated: mocks: DDrppi_mocks, DDsmu_mocks theory: DD, DDrppi, DDsmu, xi, wp I did not update DDtheta_mocks (yet), as there are implementation choices to be made: a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi) b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation To be discussed, also: 1) should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal) 2) should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function? 3) should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats. Not sure how it behaves in practice for the latter case (floats), see remark ii) below. I also have a couple of remarks: i) this line https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/Corrfunc/utils.py#L461 seems unnecessary? (as array updates are done in place and calls to this functions do not retrieve the returned arrays) ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here: https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src#L534 Is there any reason not to use a single version of this function, setup_bins? iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention; export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Thanks! Best, Arnaud

manodeep commented 2 years ago

@lgarrison I was thinking of merging to a different branch - there are some naming conventions etc that I would like alter. Is the linearbinning-rounding appropriate or should we just create a new branch?

lgarrison commented 2 years ago

If you're feeling motivated to do another editing pass, then sure, let's merge to a new branch. But the PR has been open for 8 months, and I think functionally it's ready to go! It would be great to get this released so our users can experience the benefits.

manodeep commented 2 years ago

@adematti I have merged this PR into the repo - many many thanks for such a massive effort!

@lgarrison Yeah I know - but now that I am back at ~full-time, the minor refinemennts should be much easier to get done

lgarrison commented 1 year ago

Hey @manodeep, have you had a chance to make these refinements? If not, I suggest we go ahead and merge this into master before the branches diverge even more.