special handling of linear binning

adematti commented 2 years ago

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins. There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200. I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning (which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85). The following pair counters have been updated: mocks: DDrppi_mocks, DDsmu_mocks theory: DD, DDrppi, DDsmu, xi, wp I did not update DDtheta_mocks (yet), as there are implementation choices to be made: a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi) b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation To be discussed, also: 1) should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal) 2) should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function? 3) should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats. Not sure how it behaves in practice for the latter case (floats), see remark ii) below. I also have a couple of remarks: i) this line https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/Corrfunc/utils.py#L461 seems unnecessary? (as array updates are done in place and calls to this functions do not retrieve the returned arrays) ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here: https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src#L534 Is there any reason not to use a single version of this function, setup_bins? iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention; export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Thanks! Best, Arnaud

pep8speaks commented 2 years ago

Hello @adematti! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file Corrfunc/mocks/DDrppi_mocks.py:

Line 225:80: E501 line too long (80 > 79 characters) Line 233:80: E501 line too long (80 > 79 characters) Line 236:80: E501 line too long (81 > 79 characters) Line 346:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/mocks/DDsmu_mocks.py:

Line 235:80: E501 line too long (80 > 79 characters) Line 238:80: E501 line too long (81 > 79 characters) Line 267:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/mocks/DDtheta_mocks.py:

Line 202:80: E501 line too long (82 > 79 characters) Line 212:80: E501 line too long (80 > 79 characters) Line 215:80: E501 line too long (81 > 79 characters) Line 297:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/tests/test_lin.py:

Line 15:1: E302 expected 2 blank lines, found 1 Line 17:26: E231 missing whitespace after ',' Line 17:38: E231 missing whitespace after ',' Line 23:80: E501 line too long (95 > 79 characters) Line 29:67: E251 unexpected spaces around keyword / parameter equals Line 29:69: E251 unexpected spaces around keyword / parameter equals Line 29:80: E501 line too long (98 > 79 characters) Line 33:1: W293 blank line contains whitespace Line 40:52: E231 missing whitespace after ',' Line 44:21: E128 continuation line under-indented for visual indent Line 49:22: E231 missing whitespace after ',' Line 52:1: W293 blank line contains whitespace Line 54:1: W293 blank line contains whitespace Line 55:1: E302 expected 2 blank lines, found 1 Line 55:62: E231 missing whitespace after ',' Line 59:21: E128 continuation line under-indented for visual indent Line 64:22: E231 missing whitespace after ',' Line 67:1: W293 blank line contains whitespace Line 77:44: E231 missing whitespace after ',' Line 77:52: E231 missing whitespace after ',' Line 77:58: E231 missing whitespace after ',' Line 78:80: E501 line too long (116 > 79 characters) Line 78:95: E251 unexpected spaces around keyword / parameter equals Line 78:97: E251 unexpected spaces around keyword / parameter equals Line 80:7: E231 missing whitespace after ',' Line 80:11: E231 missing whitespace after ',' Line 80:14: E231 missing whitespace after ',' Line 81:1: W293 blank line contains whitespace Line 83:19: E128 continuation line under-indented for visual indent Line 84:19: E128 continuation line under-indented for visual indent Line 85:19: E128 continuation line under-indented for visual indent Line 86:19: E128 continuation line under-indented for visual indent Line 87:1: W293 blank line contains whitespace Line 88:1: E302 expected 2 blank lines, found 1 Line 88:44: E231 missing whitespace after ',' Line 88:52: E231 missing whitespace after ',' Line 88:58: E231 missing whitespace after ',' Line 89:80: E501 line too long (145 > 79 characters) Line 89:125: E251 unexpected spaces around keyword / parameter equals Line 89:127: E251 unexpected spaces around keyword / parameter equals Line 89:138: E251 unexpected spaces around keyword / parameter equals Line 89:140: E251 unexpected spaces around keyword / parameter equals Line 91:7: E231 missing whitespace after ',' Line 91:11: E231 missing whitespace after ',' Line 91:14: E231 missing whitespace after ',' Line 92:1: W293 blank line contains whitespace Line 94:19: E128 continuation line under-indented for visual indent Line 95:19: E128 continuation line under-indented for visual indent Line 96:19: E128 continuation line under-indented for visual indent Line 97:19: E128 continuation line under-indented for visual indent Line 100:44: E231 missing whitespace after ',' Line 100:52: E231 missing whitespace after ',' Line 100:58: E231 missing whitespace after ',' Line 101:80: E501 line too long (90 > 79 characters) Line 103:7: E231 missing whitespace after ',' Line 103:11: E231 missing whitespace after ',' Line 103:14: E231 missing whitespace after ',' Line 104:1: W293 blank line contains whitespace Line 106:19: E128 continuation line under-indented for visual indent Line 107:19: E128 continuation line under-indented for visual indent Line 108:19: E128 continuation line under-indented for visual indent Line 109:19: E128 continuation line under-indented for visual indent Line 110:19: E128 continuation line under-indented for visual indent Line 113:44: E231 missing whitespace after ',' Line 113:52: E231 missing whitespace after ',' Line 113:58: E231 missing whitespace after ',' Line 114:80: E501 line too long (96 > 79 characters) Line 116:6: E231 missing whitespace after ',' Line 116:8: E231 missing whitespace after ',' Line 116:10: E231 missing whitespace after ',' Line 118:1: W293 blank line contains whitespace Line 122:1: W293 blank line contains whitespace Line 124:44: E231 missing whitespace after ',' Line 124:52: E231 missing whitespace after ',' Line 124:58: E231 missing whitespace after ',' Line 125:80: E501 line too long (111 > 79 characters) Line 127:6: E231 missing whitespace after ',' Line 127:8: E231 missing whitespace after ',' Line 127:10: E231 missing whitespace after ',' Line 131:19: E128 continuation line under-indented for visual indent Line 132:19: E128 continuation line under-indented for visual indent Line 133:19: E128 continuation line under-indented for visual indent Line 134:19: E128 continuation line under-indented for visual indent Line 137:44: E231 missing whitespace after ',' Line 137:52: E231 missing whitespace after ',' Line 137:58: E231 missing whitespace after ',' Line 138:80: E501 line too long (98 > 79 characters) Line 139:39: E127 continuation line over-indented for visual indent Line 141:6: E231 missing whitespace after ',' Line 141:8: E231 missing whitespace after ',' Line 141:10: E231 missing whitespace after ',' Line 145:19: E128 continuation line under-indented for visual indent Line 146:19: E128 continuation line under-indented for visual indent Line 147:19: E128 continuation line under-indented for visual indent Line 148:19: E128 continuation line under-indented for visual indent Line 149:19: E128 continuation line under-indented for visual indent Line 151:1: W293 blank line contains whitespace Line 152:44: E231 missing whitespace after ',' Line 152:52: E231 missing whitespace after ',' Line 152:58: E231 missing whitespace after ',' Line 153:80: E501 line too long (107 > 79 characters) Line 155:6: E231 missing whitespace after ',' Line 155:8: E231 missing whitespace after ',' Line 155:10: E231 missing whitespace after ',' Line 157:1: W293 blank line contains whitespace Line 159:19: E128 continuation line under-indented for visual indent Line 160:19: E128 continuation line under-indented for visual indent Line 161:19: E128 continuation line under-indented for visual indent Line 163:1: W293 blank line contains whitespace Line 164:44: E231 missing whitespace after ',' Line 164:52: E231 missing whitespace after ',' Line 164:58: E231 missing whitespace after ',' Line 165:80: E501 line too long (96 > 79 characters) Line 167:6: E231 missing whitespace after ',' Line 167:8: E231 missing whitespace after ',' Line 167:10: E231 missing whitespace after ',' Line 169:1: W293 blank line contains whitespace Line 171:19: E128 continuation line under-indented for visual indent Line 172:19: E128 continuation line under-indented for visual indent Line 173:19: E128 continuation line under-indented for visual indent

In the file Corrfunc/theory/DD.py:

Line 153:80: E501 line too long (80 > 79 characters) Line 156:80: E501 line too long (81 > 79 characters) Line 224:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/DDrppi.py:

Line 163:80: E501 line too long (80 > 79 characters) Line 171:80: E501 line too long (80 > 79 characters) Line 174:80: E501 line too long (81 > 79 characters) Line 271:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/DDsmu.py:

Line 179:80: E501 line too long (80 > 79 characters) Line 182:80: E501 line too long (81 > 79 characters) Line 277:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/wp.py:

Line 288:80: E501 line too long (80 > 79 characters) Line 412:80: E501 line too long (80 > 79 characters) Line 420:80: E501 line too long (80 > 79 characters) Line 423:80: E501 line too long (81 > 79 characters) Line 500:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/theory/xi.py:

Line 139:80: E501 line too long (80 > 79 characters) Line 142:80: E501 line too long (81 > 79 characters) Line 210:80: E501 line too long (96 > 79 characters)

In the file Corrfunc/utils.py:

Line 18:13: E127 continuation line over-indented for visual indent Line 18:80: E501 line too long (99 > 79 characters)

Comment last updated at 2022-05-05 19:51:25 UTC

manodeep commented 2 years ago

@adematti Great - thanks for the PR. I took a quick look at some C files and can see a couple of improvements. Quickly leaving comments for now - will add a full code review later.

It might be better to rename the new function setup_bin_type to detect_bin_type
What happens when there is exactly 1 bin - the current setup will end up with a divide by zero
In the kernels, you are likely to get better performance by computing the reciprocal of rpstep and then multiplying with that reciprocal (in both the SIMD and the regular trailing loop sections)
Will you please check if it is possible to cast the linear histogram bin calculation [truncate((r - rmin)*inv_rstep)] as a fused multiply-add (FMA)? If so, that might save a bit more cpu cycles

manodeep commented 2 years ago

Hi,

As discussed with Lehman for DESI applications, this is a first attempt to implement a special handling of linear binning, which helps corrfunc run faster with a large number of bins. There is no speed gain at low number of bins ~ 10, but a speed gain > 3x for bins ~ 200. I added a flag bin_type to the Python wrapper, to choose between: 'auto', 'lin', 'custom'; with 'lin' for linear binning, 'auto' (default) to automatically detect linear binning (which happens here: https://github.com/adematti/Corrfunc/blob/c500d2c9137ff7f66d34eb4189bc19f9325d84ff/utils/utils.c#L85). The following pair counters have been updated: mocks: DDrppi_mocks, DDsmu_mocks theory: DD, DDrppi, DDsmu, xi, wp I did not update DDtheta_mocks (yet), as there are implementation choices to be made: a) should I consider the binning to be linear in theta or acos? (I guess the former, as the pair counts for the latter can be obained from DD if maximum separation < \pi)

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

b) should I use fast acos to compute the bin for a given pair? in this case, a pair can fall in the wrong bin due to numerical approximation To be discussed, also:

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

1. should I pass rstep (bin width), rmin to the kernels directly (currently taking the square root of sqr_rmin, sqr_rmax, which is slightly suboptimal)

Yeah - I saw that one. May be change the behaviour to pass rmin and rmax to the kernels, and then square those to get the sqr_rmin and sqr_rmax`?

2. should I print the chosen binning type; in this case, where? in each countpairs_DOUBLE function?

Yes, but only if options->verbose is set.

3. should I leave option 'auto'? it checks for linear binning, with absolute and relative tolerance of 1e-12 for both double and floats.
   Not sure how it behaves in practice for the latter case (floats), see remark ii) below.

I would change the tolerance to be appropriate for the precision of the numeric type. 1e-12 seems reasonable for double but too small for float (probably more like 1e-8). May be check what the default options for np.allclose are?

   I also have a couple of remarks:
   i) this line https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/Corrfunc/utils.py#L461
    seems unnecessary?
   (as array updates are done in place and calls to this functions do not retrieve the returned arrays)

This is for @lgarrison

   ii) I don't think setup_bins_float is used anywhere in the code, and setup_bins_double is only used here: https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDtheta_mocks/countpairs_theta_mocks_impl.c.src#L534

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

   Is there any reason not to use a single version of this function, setup_bins?
   iii) Some time ago I had binning errors with Corrfunc after calling matplotlib. I figured out this was due to how the string separation for floats was handled (matplotlib changed the environment variables specifying this convention;
   export LC_NUMERIC=C solved this.). For this issue not to happen to other people, I guess the simplest solution would be to pass bins as an array rather than a file to be read from disk. Is there any reason not to do so?

Huh - okay! I can see that happening but that's poor behaviour on maplotlib's part.

The reason for specifying the bins from a file is historical - reading from a file lets us keep the exact same bins as the output from a previous Corrfunc run.

Many thanks for using Corrfunc and adding such a feature! Hopefully, I have answered all/most of the queries.

Cheers, Manodeep

Thanks! Best, Arnaud

manodeep commented 2 years ago

* Will you please check if it is possible to cast the linear histogram bin calculation `[truncate((r - rmin)*inv_rstep)]` as a fused multiply-add (FMA)? If so, that might save a bit more cpu cycles

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#avx512techs=AVX512F&expand=3441,3186&text=mm512_mask3_fmadd_round_p

If you rewrite the bin calculation from rpbin := (r - rmin)*inv_rstep to rpbin := r*inv_rstep - rmin*inv_rstep, then the calculation takes the required FMA form. The rounding parameter in the intrinsic is likely the _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC but might require a separate truncate to integer.

adematti commented 2 years ago

Thanks for your answer! Having one bin (nbin = 2) is ok; zero bin (nbin < 2) fails after checks like https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDsmu_mocks/countpairs_s_mu_mocks_impl.c.src#L302 I can have detect_bin_type return an EXIT_FAILURE in case nbin <= 1 (though that will have no impact on the code). Concerning kernels and FMA, I will check that asap.

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

hum, like "cos_lin"? but that will only make sense for DDtheta...

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

My point is that using fast-acos to perform linear binning may yield different npairs than with the "custom" binning (fast-acos being an approximation). I don't think we want that. So linear binning should rely on true-acos, whatever is required for thetaavg. I guess that will degrade efficiency by a lot - though not tested yet.

By the way, "auto" is used by default, meaning linear binning whenever input bins are linear. My tests showed it did not slow down pair counts even with 10 bins, but these tests are not exhaustive...

May be check what the default options for np.allclose are?

This is rtol=1e-05, atol=1e-08. It may yield incorrect pair counts. Then I guess we should have "custom" as a default instead of "auto", as pair counts should be guaranteed correct with the default options.

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

hm... setup_bins_float and setup_bins_double are defined in utils.c. There is no utils.c.src, hence no setup_bins_DOUBLE definition. All the calls are to setup_bins (except for DDtheta, where setup_bins_double in called). Maybe I missed something?

adematti commented 2 years ago

Hi, Sorry, I meant to push this earlier, but forgot... I think all your comments above have been addressed, but: 1) bin_type=r'auto' => r'' is not necessary, but have included it as it was in e.g. isa=r'fastest' 2) linear binning treatment for DDtheta still missing 3) I could not check implementation for AVX512 kernels as my processors do not support it I reverted the default binning type to 'custom', to avoid issues related to "binning is interpreted as linear but is actually not" with default options.

lgarrison commented 2 years ago

Hey @adematti, thank you so much for this contribution! This is not a small effort, and it looks great so far. And thanks @manodeep for reviewing this while I was away. I will try to catch up on the full set of comments above and do my own review, but probably not until tomorrow or Monday at the earliest.

manodeep commented 2 years ago

Thanks for your answer! Having one bin (nbin = 2) is ok; zero bin (nbin < 2) fails after checks like

Great! I always forget that the way the code is setup nbins in the code is not the usual nbins from the user-perspective

https://github.com/manodeep/Corrfunc/blob/74c6fc29f9a0236eaebbc1830a1f59fd9a53bfc8/mocks/DDsmu_mocks/countpairs_s_mu_mocks_impl.c.src#L302

I can have detect_bin_type return an EXIT_FAILURE in case nbin <= 1 (though that will have no impact on the code). Concerning kernels and FMA, I will check that asap.

Awesome - yup!

I suspect linear in theta is the more typical choice. Is it possible to support both through some user-option?

hum, like "cos_lin"? but that will only make sense for DDtheta...

Or better yet, we can design for the ability to allow cos_lin in the future but only implement theta_lin for now.

Hmmm - fast-acos option should be independent of the bin-type choice. If the user has specified fast_acos, then we should the approximations, otherwise, we should use the full-accuracy trig function

My point is that using fast-acos to perform linear binning may yield different npairs than with the "custom" binning (fast-acos being an approximation). I don't think we want that. So linear binning should rely on true-acos, whatever is required for thetaavg. I guess that will degrade efficiency by a lot - though not tested yet.

If I am understanding correctly, then you are referring to the impact of fast_acos during the detection of the bin type. If so, yes, I agree - the bin detection should run with the full precision acos. The impact of fast_acos is only in the kernels where the actual calculations are done.

The source (.c.src files) are pre-processed by sed twice and DOUBLE becomes double and float -- i.e., setup_bins_DOUBLE translates to both setup_bins_float and setup_bins_double

hm... setup_bins_float and setup_bins_double are defined in utils.c. There is no utils.c.src, hence no setup_bins_DOUBLE definition. All the calls are to setup_bins (except for DDtheta, where setup_bins_double in called). Maybe I missed something?

Most of the functions in utils.c.src are not precision dependent - hence we did not have the two versions. However, that can certainly change if the need arises.

adematti commented 2 years ago

Or better yet, we can design for the ability to allow cos_lin in the future but only implement theta_lin for now.

bin_type = 'lin' is now implemented for DDtheta_mocks

If I am understanding correctly, then you are referring to the impact of fast_acos during the detection of the bin type.

No, I am referring to the impact of using fast_acos to get theta, and then the bin index (theta - theta_min)/nbins + theta_min. With fast_acos=True instead of fast_acos=False, some fraction of pairs may shift bins (in addition to thetaavg being different if required). I have added a note in the docstrings to explain this.

lgarrison commented 2 years ago

@adematti Just enabled the CI checks. Have a look when the tests are finished running and see if there are any failures? I can't remember if the CI has AVX-512, but if not, I'll run the tests locally (Cori KNL also has AVX-512).

manodeep commented 2 years ago

Noting that this will need to be tested with something similar to make tests -DINTEGRATION_TESTS (i.e., tested for all combinations of bin_refinement_factors, isa, min_sep_opt etc. These tests should that the output is identical between bin_type = custom and bin_type = lin (for a different set of linear bins) - i.e., what @adematti has already done within the python code

manodeep commented 2 years ago

@lgarrison 2.4.0 is already overdue - is this okay to push back to milestone 2.5.0?

lgarrison commented 2 years ago

I was thinking this was useful enough that we'd want to do a release when it is ready, but also that it is progressing pretty rapidly, so we'd just end up doing two releases back-to-back if we target 2.5. But I don't feel strongly; we can instruct DESI to install from git if necessary (although they'll prefer versioned code eventually).

manodeep commented 2 years ago

@lgarrison I agree - this is a significant enough change to warrant a release. I have moved the milestone to 2.5.0 (including all remaining 2.4.0 milestoned issues). Let me get to releasing the v2.4.0 now

adematti commented 2 years ago

clang tests fail because the -mfma flag is required. gcc tests fail because of this line (which I left untouched): https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/theory/DD/countpairs_kernels.c.src#L35 I guess one could remove it? I very naively tried compiling on Cori KNL, after loading craype-mic-knl, but install did not pick the AVX512 flag, I guess there are other things to specify?

lgarrison commented 2 years ago

clang tests fail because the -mfma flag is required.

Hmm, would have hoped -march=native would have set that if the CPU supports it. The CI CPUs probably support it, as its quite old. But if it's a weird virtual environment they might not, in which case the correct behavior would be to protect the SSE kernels with an FMA check.

We should probably add a lscpu command to the GitHub CI so we can triage cases like this. Will PR this...

gcc tests fail because of this line (which I left untouched):

https://github.com/manodeep/Corrfunc/blob/596fe77078d59b296b34608927e301c427331919/theory/DD/countpairs_kernels.c.src#L35

I guess one could remove it?

Yeah, should remove that line now that those variables are not part of the function signature. And actually, sqr_rpmin below that can be removed completely, as its unused.

I very naively tried compiling on Cori KNL, after loading craype-mic-knl, but install did not pick the AVX512 flag, I guess there are other things to specify?

I guess so. I'll try it myself once I fix a quota issue at NERSC!

lgarrison commented 2 years ago

Hi @adematti, can you merge the latest master into your feature branch and push so the tests are rerun with the new CI config? That will also dump the CPU flags so we can debug the OSX failure, although my preliminary look at #259 suggests that the OSX CI CPUs indeed have SSE but no FMA, and since this branch introduces an FMA requirement, it breaks. So, we could require that both FMA and SSE are present; that shouldn't be too burdensome, as FMA is quite old (2011/2012).

lgarrison commented 2 years ago

Here's a better solution, I think: in sse_calls.h, can you just alias the call to a non-FMA version if FMA isn't available? i.e.:

#ifdef __FMA__
#define SSE_FMA_ADD_FLOATS(X,Y,Z)          _mm_fmadd_ps(X,Y,Z)
#else
#define SSE_FMA_ADD_FLOATS(X,Y,Z)          _mm_add_ps(_mm_mul_ps(X,Y),Z)
#endif

lgarrison commented 2 years ago

So I think (but am not 100% positive) that the __FMA__ flag only corresponds to SSE FMA, not AVX FMA. E.g. the Wikipedia page calls it out as an SSE extension specifically: https://en.wikipedia.org/wiki/FMA_instruction_set.

adematti commented 2 years ago

I see, thanks! I read FMA was introduced at the same time as AVX2, hence wrongly thought FMA flag was defined for AVX as well. I will remove changes to avx_calls.h (and solve errors below).

adematti commented 2 years ago

Removing the FMA flag for AVX raises again errors like:

'countpairs_avx_intrinsics_double' that is compiled without support for 'fma'
                m_rpbin = AVX_FMA_ADD_FLOATS(union_mDperp.m_Dperp,m_inv_rpstep,m_rpmin_invstep); // r2, then union_mDperp.m_Dperp aready clipped

(there are also some more bugs with AVX512, not unexpected as I could not try compiling on my computer)

lgarrison commented 2 years ago

Okay! You were right the first time, then; feel free to add the check back to AVX too.

For what it's worth, if you're trying to check if AVX-512 is compiling properly, you can probably hack it to compile on your computer by replacing -march=native with -march=skylake-avx512 or similar. You won't be able to run it, but it ought to compile.

manodeep commented 2 years ago

Apologies for the long silence -- FMA instructions should only be used in AVX2 and AVX512 kernels; the AVX and SSE kernels need to do the two separate ops for multiply followed by the add.

manodeep commented 2 years ago

For what it's worth, if you're trying to check if AVX-512 is compiling properly, you can probably hack it to compile on your computer by replacing -march=native with -march=skylake-avx512 or similar. You won't be able to run it, but it ought to compile.

Yup - let's make sure about the code compiling. The GH actions cpus have AVX512F - so pushing should automatically test those new kernels.

I do have access to skylake cpus - so will setup a comprehensive testing when I get a chance.

lgarrison commented 2 years ago

Dumb question, but is Corrfunc/tests.py actually ever run as part of the CI? I think we trigger the tests with make tests, and I'm not sure that ever calls it. @manodeep am I missing something? I think this is the main place that we're testing the linear binning.

manodeep commented 2 years ago

Dumb question, but is Corrfunc/tests.py actually ever run as part of the CI? I think we trigger the tests with make tests, and I'm not sure that ever calls it. @manodeep am I missing something? I think this is the main place that we're testing the linear binning.

I don't think that is ever used in the CI - only mentioned in the docs as a way of checking whether the python installation worked.

lgarrison commented 2 years ago

Got it, so we ought to start calling that as part of the CI or migrate the linear binning tests elsewhere. The former would repeat some work that make tests already does, but is probably the most straightforward (and if we have a tests.py, I think most users/devs would expect it to be called from the tests).

adematti commented 2 years ago

I can add - python Corrfunc/tests.py after https://github.com/manodeep/Corrfunc/blob/d51f43c91ca87718a5eac88ba6c752111f7257cc/.travis.yml#L134 Maybe one should have python tests removed from theory/tests/Makefile (same for mocks), and all tests based on the Python wrapper within Corrfunc?

lgarrison commented 2 years ago

Oops, I didn't see your comment before I opened the PR! And yeah, I think it would be ideal to consolidate the Python testing; there are basically two call_correlation_functions.py files right now, one in theory/python_bindings and one in Corrfunc/. I guess technically the former doesn't require that the package be installed, just built, while the latter probably requires it to be installed. And since make tests doesn't actually install anything, one can't call anything in Corrfunc/ from there... so not quite so easy to consolidate.

So #260 might be the easiest solution for now, but I'm open to ideas.

manodeep commented 2 years ago

I have merged the PR that adds the python CI tests into GH Actions and restarted the tests - hopefully the python tests will run now

lgarrison commented 2 years ago

Doesn't appear so; @adematti, can you merge master into linearbinning?

lgarrison commented 2 years ago

Looks like the linear binning is failing in some cases; it appears to only happen in the AVX-512 kernels, if I'm reading the logs right.

@adematti Since you can't run AVX-512 (except maybe on KNL?), if you can't find this bug just by reading the code, then I'll take a look. I may not have a chance until next Wednesday though, unfortunately.

adematti commented 2 years ago

Yes, indeed :/ Unfortunately I faced this issue: https://github.com/manodeep/Corrfunc/issues/183 when trying to compile on Cori KNL.

lgarrison commented 2 years ago

I discovered that the AVX-512 problem is that _mm512_maskz_fmadd_round_pd(MASK, X, Y, Z, _MM_FROUND_TO_ZERO|_MM_FROUND_NO_EXC) is not rounding or truncating the result; it's as if no rounding took place. Still trying to understand why.

lgarrison commented 2 years ago

@adematti Here is my attempt at a workaround: https://github.com/manodeep/Corrfunc/tree/linearbinning-rounding

Can you merge that into your branch?

It's possible switching the whole bin index computation to ints would be more efficient, but the int AVX512 macros are not set up in the way we would want them to be regarding type widths, I think. So this is the easiest solution for now.

We ought to assess the performance impact of the extra rounding, since it's kind of expensive (8 cycles). Hopefully the linear bin computation will more than make up for that.

adematti commented 2 years ago

done! sorry for the delay

lgarrison commented 2 years ago

I've added some commits to that branch, can you merge again? These should remove some of the unrelated test failures that are muddying the CI.

lgarrison commented 2 years ago

So, I think this is all working now, but not all of the GitHub Actions runners have AVX-512 support. So @manodeep, can you run the tests locally, including Corrfunc/tests.py? It all passes for me but we ought to double-check since it's not clear why the AVX-512 workaround for the rounding is even necessary.

manodeep commented 2 years ago

Ran the tests on an Skylake cpu with AVX512 - make tests passed. python -m Corrfunc.tests failed with a different error:

Running 2-D correlation function xi(rp,pi)

_countpairs_mocks.error: ArgumentError: In DDrppi_mocks> Could not parse the arguments. Input parameters are:
autocorr, cosmology, nthreads, pimax, binfile, RA1, DEC1, CZ1, weights1, RA2, DEC2, CZ2, weights2, is_comoving_dist, verbose, output_rpavg, fast_divide_and_NR_steps, xbin_refine_factor, ybin_refine_factor, zbin_refine_factor, max_cells_per_dim, copy_particles, enable_min_sep_opt, c_api_timer, isa, weight_type,
Traceback (most recent call last):
  File "/apps/skylake/software/Python/3.8.5-gni-2020.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/apps/skylake/software/Python/3.8.5-gni-2020.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 183, in <module>
    test_linear_binning_mocks(isa=isa)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 79, in test_linear_binning_mocks
    test_bin_type(DDrppi_mocks, autocorr, cosmology, nthreads,
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 76, in test_bin_type
    assert allclose(*res)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 74, in <genexpr>
    res = (pair_counter(*args, bin_type=bin_type,
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/mocks/DDrppi_mocks.py", line 401, in DDrppi_mocks
    raise RuntimeError(msg)
RuntimeError: RuntimeError occurred

Is that related to #256 ?

(I used git fetch origin pull/258/head:linearbinning-avx512test and then checked out the new branch to run the tests)

lgarrison commented 2 years ago

Thanks for checking this. Did you run pip install -e ./ before running python -m Corrfunc.tests? Otherwise, it may be picking up an old version of the Python extension.

manodeep commented 2 years ago

This might have been because I had not re-installed the package. That parsing error went away after I re-installed this version of Corrfunc (python -m pip install -e .). I see this test failure:


[~/research/JournalSubmissions/SpeedingUpCode @farnarkle1] python -m Corrfunc.tests
Done reading the data - time taken =        1.2 seconds
Beginning Theory Correlation functions calculations
Running 3-D correlation function DD(r) (isa=fallback)
Custom binning
Running with points in [xmin,xmax] = 0.003492,419.960876 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.019300,419.992493 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.016100,419.983887 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 27,27,13.  Time taken =   0.012 sec
Using fallback kernel
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done. Time taken =  0.017 secs
Linear binning
Running with points in [xmin,xmax] = 0.003492,419.960876 with periodic wrapping = 420.000000
Running with points in [ymin,ymax] = 0.019300,419.992493 with periodic wrapping = 420.000000
Running with points in [zmin,zmax] = 0.016100,419.983887 with periodic wrapping = 420.000000
In gridlink_double> Running with [nmesh_x, nmesh_y, nmesh_z]  = 27,27,13.  Time taken =   0.009 sec
0%.........10%.........20%.........30%.........40%.........50%.........60%.........70%. done. Time taken =  0.014 secs
Traceback (most recent call last):
  File "/apps/skylake/software/Python/3.8.5-gni-2020.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/apps/skylake/software/Python/3.8.5-gni-2020.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 186, in <module>
    test_linear_binning_theory(isa=isa)
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 149, in test_linear_binning_theory
    test_bin_type(DD, autocorr, nthreads, binfile, x, y, z, weights1=w,
  File "/home/msinha/research/JournalSubmissions/SpeedingUpCode/Corrfunc/tests.py", line 146, in test_bin_type
    assert allclose(*res)
AssertionError

``

manodeep commented 2 years ago

Thanks for checking this. Did you run pip install -e ./ before running python -m Corrfunc.tests? Otherwise, it may be picking up an old version of the Python extension.

Hehehe - just saw this comment. Yup - that's what it was :)

lgarrison commented 2 years ago

Hmm, that's interesting, it's failing in the fallback kernel. Doesn't happen on the CI, or locally for me. I wonder if some floating-point optimization by the compiler has caused some pairs to shift by one bin. Could you add some print statements to near the allclose assertion that is failing to see if that is indeed the case?

manodeep commented 2 years ago


def allclose(a, *others):
    toret = True
    for b in others:
        for name in ['npairs','weightavg']:
            if not np.all(np.allclose(a[name], b[name])):
                print("Mis-match for {0}: (a, b) = {1} ".format(name, list(zip(a[name], b[name]))))
            toret &= np.allclose(a[name], b[name])

    return toret

Mis-match for npairs: (a, b) = [(21808, 22552), (24904, 26224), (28480, 31424), (34104, 35104), (41368, 41208), (43344, 43584), (50584, 50544), (52880, 52336), (57496, 58392), (57736, 59912), (63592, 63016), (63664, 61720), (65416, 64376), (70248, 69048), (75232, 75016), (76456, 76664), (78032, 77760), (82144, 82608), (84976, 85864), (86048, 85136), (91576, 92136), (86176, 85944), (91872, 91336), (97200, 98192), (101352, 102720)] 
Mis-match for weightavg: (a, b) = [(0.5609590045092565, 0.5571883705368976), (0.5446279905774939, 0.5533362350186771), (0.5520848249787106, 0.5609407405525946), (0.5542048901243994, 0.5596741404726092), (0.5475160879310123, 0.5643908987457708), (0.5375731064997236, 0.5520007372445179), (0.5469271713018543, 0.5538971501774297), (0.5446838832048764, 0.5564163252862282), (0.5521477263856891, 0.5606700136418482), (0.5471066480168612, 0.5537097729699632), (0.5448368669807974, 0.5552436242463595), (0.5478302517039543, 0.5559992871052971), (0.5512250019739253, 0.5599964832458033), (0.5500369547269897, 0.5509011144491291), (0.551650426987595, 0.5583687366500986), (0.5560012624078268, 0.5601045625262797), (0.554371074603747, 0.5580061017611274), (0.5532532112069565, 0.5546657346681807), (0.5544148749342706, 0.5581064282419461), (0.5546862509369678, 0.5560676632312374), (0.549754983582397, 0.5528945618562301), (0.5453400721288449, 0.5494928001336729), (0.551805714780378, 0.5542515789660692), (0.5506258670018143, 0.5544890328546984), (0.5472743388353181, 0.5517854344001935)]

Fairly big differences - so off by one index is possible. Best to debug on one pair-counter and then port the solution across to the others

lgarrison commented 2 years ago

Yeah, definitely more than a few pairs. I haven't been able to reproduce... can you share your compiler version?

manodeep commented 2 years ago

Here are all the relevant packages (modules):gcc/9.2.0 gcccore/9.2.0 binutils/2.33.1 gsl/2.5 python/3.8.5 numpy/1.19.2-python-3.8.5

lgarrison commented 2 years ago

Great, thanks. Unfortunately, I will probably not have time to investigate this until next week. But if you or @adematti feel inspired to do some debugging, please go ahead!

adematti commented 2 years ago

Hi, sorry I haven't had the time yet to investigate that. Using gcc/9.3.0, binutils/2.34, gsl/2.5 python/3.8.5 numpy/1.19.2 and python/3.8.5 returns correct results. I haven't tried (yet) with gcc/9.2.0 and binutils/2.33.1.

manodeep commented 2 years ago

I built with gcc/7, binutils/2.30 (with disabled AVX512 due to the GNU Assembler bug) and then the tests passed. Repeated (just for my sanity) with gcc/9 and binutils/2.33 - test failed for fallback.

manodeep commented 2 years ago

Disabling AVX512 by adding CFLAGS ?= -mno-avx512f into common.mk and re-compiling with gcc/9 + binutils/2.33 still results in the test failure.

(This is weird)

manodeep / Corrfunc

special handling of linear binning #258

Comment last updated at 2022-05-05 19:51:25 UTC