Interface suggestions for new wigner-related calculations

zatkins2 commented 5 months ago

Hi Martin,

I'm excited (via @xzackli) to use your in-development coupling_matrix_spin0and2 methods! The ability to calculate coupling matrices for multiple spectra for roughly constant cost (independent of the number of spectra) will really enable some computations that are now prohibitive (since we currently calculate the same wigner symbols over and over for each spectrum separately).

To make it even more useful, could I suggest an additional interface? As it is now, all possible "flavors" of coupling matrix (00, 02, 20, ++, --, in that order) are calculated in coupling_matrix_spin0and2. Sometimes, not all of these "flavors" are necessary, e.g., we may want to skip calculating the last "--" flavor because it is negligible at our scales of interest. It would be nice to be able to tell the function which flavors we would like to compute (assuming the required spin-weighted spectra are passed), to save unnecessary computation. If it is no extra cost to compute all flavors though, then feel free to ignore me!

A separate request (that is lower priority for me personally :)) would be to be able to pass l1min, l1max to wigner3j_int, or even possibly an explicit list of l1 values, rather than having the function calculate all possible values. But again, if this is not feasibly or doesn't save any time, also feel free to ignore me!

Many thanks, very much looking forward to using this functionality!

Zach

mreineck commented 5 months ago

Hi Zach,

great to hear that this might be useful!

Concerning the selection of coupling matrix flavors: as far as I can see at the moment (I've been working on other stuff recently, so this is all a little rusty :), there is not much performance to be gained by leaving out individual parts, with a few exceptions:

you only need 00 (but for that we already have a separate routine)
you only need ++ and -- (I don't have a special case for that, but the gain wouldn't be very big anyway)

That's for coupling_matrix_spin0and2. For the _pure variant things are more complex, but there also is only a very small subset of components where computing time can be saved.

There is however another aspect to this: if there is concern that the matrices become very big and you want to save memory (as opposed to time) by omitting some flavors, that can certainly be done, but I need some time for the adjustment. (However if that is a concern, we could always save a nice factor of 2 by just returning just an upper triangular matrix and having the caller apply the (2*l+1) factors...)

Specifying an l1 subset to wigner3j_int unfortunately does not save any time, since all possible coefficients must be computed anyway to get the proper normalization.

zatkins2 commented 5 months ago

Good point, yes, memory could be an issue. If it is the case that the runtime is ~constant for large numbers of spectra, then I think being able to specify the exact flavor you want would be useful, even if it means ~2 or 3x the total computation.

But I don't want to hold up getting what you have available in ducc; the current interface on your implementation would still be a massive improvement, so I am definitely eager to make use of it as soon as possible :)

(Another motivating example is that in practice it is often the case that the 02 spectra are equal to the 20 spectra)

mreineck commented 5 months ago

If you can already make use of the interface as it currently is, then I think it might be best to use it in this form for now. That will give us valuable information and allow us to come up with an even better interface in the next iteration.

Please let me know it I can help with integration in any way!

zatkins2 commented 5 months ago

ah I'm not sure I can, at least, if I install 0.33, I don't see the functions available. would I need to download the current state of the repo and install from that, rather than from pypi?

mreineck commented 5 months ago

Sorry, my bad: since these functions still have experimental interfaces, they are accessible as ducc0.misc.experimental.coupling_matrix_spin0and2 etc.

mreineck commented 5 months ago

But they should be part of the official 0.33 release.

zatkins2 commented 5 months ago

Got it, yep I see them, thanks! I will play around and let you know my experience (but I do expect I will want to be able to select the flavors)

zatkins2 commented 5 months ago

I just played around a bit. It is really impressive! I observe that there is constant scaling with nspec up to a point, and then it becomes linear. The point when it switches to linear grows with lmax, since that allows the 3j computations to dominate. As a quick benchmark, for lmax=10800, with nspec=1, I find it takes ~3 min on 10 cores, and with nspec=15, I find it takes ~ 10 min.

I do think it would be great if I could select the flavor. I know for the immediate application I have in mind, I will only need the 00, 02, and ++ flavors (so 40% speedup right away).

Another great option, like you mentioned, would be to not apply the 2l+1 factors. I will mostly be interested in building the symmetric coupling matrices, and could apply the 2l+1 factors later. Under the hood, I'm not sure if you're already doing this, but if not, this would allow you to compute only the upper triangle and then just copy the result to the lower triangle at the end (a 50% speedup).

Finally, if it could support single precision, that would be fantastic (another 50% speedup).

To be clear, I am making all these suggestions because this implementation is absolutely fantastic!

zatkins2 commented 5 months ago

And yes besides speedup, eg single precision will be important for SO for memory. The combinatorics of the covariance matrix will require a huge number of couplings to be calculated

mreineck commented 5 months ago

Just curious: did you use a binary wheel or did you build from source, using pip install --no-binary ducc0? If the former, you can get another nice speedup by compiling yourself.

I do think it would be great if I could select the flavor. I know for the immediate application I have in mind, I will only need the 00, 02, and ++ flavors (so 40% speedup right away).

I will certainly think about this!

Another great option, like you mentioned, would be to not apply the 2l+1 factors. I will mostly be interested in building the symmetric coupling matrices, and could apply the 2l+1 factors later. Under the hood, I'm not sure if you're already doing this,

I do :) So there will not be much speedup from this ...

I hesitated using the "half-matrix" format because it makes the structure of the output array more obscure; but halving memory is just too big of an advantage...

Finally, if it could support single precision, that would be fantastic (another 50% speedup).

A large part of the calculations probably need to stay in double precision to avoid error accumulation. Input and output can certainly be in single precision however!

zatkins2 commented 5 months ago

Just curious: did you use a binary wheel or did you build from source, using pip install --no-binary ducc0? If the former, you can get another nice speedup by compiling yourself.

Yep I did!

I will certainly think about this!

Great, thanks! Would also help with memory.

I do :) So there will not be much speedup from this ...

Ok sure. It would still be nice to let the user decide about the 2l+1 factor I think. Practically speaking anyway, there are many more matrices computed for the covariance (which don’t have that factor) than for the spectra (which do).

A large part of the calculations probably need to stay in double precision to avoid error accumulation. Input and output can certainly be in single precision however!!

Awesome! No worries on the computation but yes for memory management it would be a great option.

xzackli commented 5 months ago

Thank you again @mreineck, I just want to reiterate Zach's comments that your work here has significant analysis implications for ACT and SO covariances, which grinds through a tremendous number of these coupling matrices.

mreineck commented 5 months ago

I have added a new method coupling_matrix_spin0and2_tri, which allows selecting the accuracy of the output and returns triangular matrices (see the docstring for memory layout).

You'll have to install this one directly from the repo, it's on branch mcm_work.

Concerning selectable flavors: to get good performance I fear that I'll have to write a custom implementation for every combination of provided input spectra and required output matrices. It' not a big deal however (I expect around 10-15 minutes of work for adding a new combination).

@zatkins2, did I understand correctly that in your first test case you want to provide three spectra instead of four , assuming that the 02 spectrum is the same as the 20 one, and that you only need the first, second and fourth matrix instead of the full output?

mreineck commented 5 months ago

One more thing I should perhaps add: given the limited memory bandwidth of a compute node, it might well happen that there may be an optimal number of spectra to compute simultaneously, instead of the idealized "the more, the better" expectation. This will depend on lmax, the number of threads and the underlying hardware (especially cache sizes). To get the absolute maximum performance out of the algorithm, some experimentation will be required, but for large data sets this should be worth the effort.

zatkins2 commented 5 months ago

I have added a new method coupling_matrix_spin0and2_tri, which allows selecting the accuracy of the output and returns triangular matrices (see the docstring for memory layout).

You'll have to install this one directly from the repo, it's on branch mcm_work.

Very cool! I will try it out. As far as the triangular layout goes, do you have a (parallelized!) function that can turn it into the triangular half of a 2d array? That I think would be necessary. Alternatively, Sigurd has one here, which would work where iainfo and oainfo correspond to the triangular and rectangular layout, but only if you were to change your triangular layout to match what he adheres to here (it looks like they differ).

Concerning selectable flavors: to get good performance I fear that I'll have to write a custom implementation for every combination of provided input spectra and required output matrices. It' not a big deal however (I expect around 10-15 minutes of work for adding a new combination).

@zatkins2, did I understand correctly that in your first test case you want to provide three spectra instead of four , assuming that the 02 spectrum is the same as the 20 one, and that you only need the first, second and fourth matrix instead of the full output?

Understood and that's correct. I think having the current version available, this proposed version available, and two more would cover every case that would actually happen:

consume three spectra (assuming 02 is the same as 20), and produce the first, second, fourth, and fifth matrix
consume all four spectra, and produce all but the fifth matrix

Another random thought: if we knew that the spectra themselves were the same, could that help optimize things? E.g. in the current implementation, if all four spectra were the same, would there be some way of calculating the five matrices faster or using less memory? That situation will be common.

mreineck commented 5 months ago

I think the two memory layouts are actually the same, if m corresponds to l1 and l to l2 ... but I admit that this is not immediately obvious from the formulas. So you should be able to use pixell routines on the arrays.

From the custom routines, I have implemented the proposed one, under the provisional name coupling_matrix_spin0and2_tri_zach1 :-) The others will probably be ready later today.

Another random thought: if we knew that the spectra themselves were the same, could that help optimize things? E.g. in the current implementation, if all four spectra were the same, would there be some way of calculating the five matrices faster or using less memory? That situation will be common.

Yes, that will most likely help things. But again, this will be something that has to be hardwired into the sources in some way, if/else branches at runtime will most likely slow things down.

mreineck commented 5 months ago

The subtle point about Sigurds formula mstart = stride*(m*(2*lmax+1-m)//2) is that it is actually equivalent to mstart = stride*((m*(2*lmax+1-m))//2) and not equivalent to mstart = stride*(m*((2*lmax+1-m)//2)) as I read it first. With that misunderstanding out of the way, I convinced myself that the storage schemes are in fact identical.

zatkins2 commented 5 months ago

I convinced myself that the storage schemes are in fact identical.

Great.

From the custom routines, I have implemented the proposed one, under the provisional name coupling_matrix_spin0and2_tri_zach1 :-) The others will probably be ready later today.

Amazing, thank you so much Martin!

Yes, that will most likely help things. But again, this will be something that has to be hardwired into the sources in some way

Got it, I will leave it as food-for-your-thought.

mreineck commented 5 months ago

The _zach2 and _zach3 variants are now also in the code, but very much untested. Have fun!

zatkins2 commented 5 months ago

Ok! Will work off the mcm_work branch for now and let you know if any comments/issues. Thanks again.

mreineck commented 5 months ago

How about this fully generic interface for coupling_matrix_spin0and2_tri Instead of the "full" version coupling_matrix_spin0and2_tri(spec, lmax, nthreads=nthreads, singleprec=True) one could write coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,2,3), mat_index=(0,1,2,3,4), nthreads=nthreads, singleprec=True) spec_index and mat_index describe where the individual spectra/matrices are located, and if the index values are negative, the respective spectrum is not provided / the matrix is not requested.

The zach1, zach2 and zach3 versions would then look like this: coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,1,2), mat_index=(0,1,-1,-1,2), nthreads=nthreads, singleprec=True) coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,1,2), mat_index=(0,1,-1,2,3), nthreads=nthreads, singleprec=True) coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,2,3), mat_index=(0,1,2,3,-1), nthreads=nthreads, singleprec=True)

Using this syntax, it would also be easy to express that all spectra are equal (spec_index=(0,0,0,0)). I think I can mirror this on the C++ side using templating, which could make the sources significantly shorter. However, I'd still need to explicitly instantiate all the variants of the function we need, i.e. it won't be possible to access all imaginable combinations even after this change.

Do you think this would be a useful change?

mreineck commented 5 months ago

Sorry ... I just was impatient and pushed the change :-) I'm pretty sure this is a step in the right direction. There is now only a single function for all the triangular coupling matrices, called coupling_matrix_spin0and2_tri, which supports all the variants discussed above, plus pure spin 0, plus (for experimenting) a version that computes all matrices, but assumes that all spectra are equal. Please have a look at the docstring and let me know if anything is unclear!

Out of curiosity: do you expect to use the spin0and2_pure function as well? I had expected that this would be required for accurate calculations.

mreineck commented 5 months ago

I pushed another change that accelerates coupling_matrix_spin0and2_tri when only a small number of spectral sets is provided, so don't be surprised if timings for 1-4 sets (pure spin0) and 1-2 sets (all other cases) look strange compared to other configurations.

Overall I'm now fairly happy with the current state. The only aspect I don't like very much is the really long compile time...

zatkins2 commented 5 months ago

Sorry for not replying sooner. Your consolidation looks nice to me!

Would the zach1 case not correspond to this: coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,1,2), mat_index=(0,1,-1,2,-1), nthreads=nthreads, singleprec=True) (notice the difference in the mat_index)?

For clarity, I think there are 4 discussed cases, including your original one, that would look like this in the new interface. Lmk if you agree: coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,1,2), mat_index=(0,1,-1,2,-1), nthreads=nthreads, singleprec=True) coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,1,2), mat_index=(0,1,-1,2,3), nthreads=nthreads, singleprec=True) coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,2,3), mat_index=(0,1,2,3,-1), nthreads=nthreads, singleprec=True) coupling_matrix_spin0and2_tri(spec, lmax, spec_index=(0,1,2,3), mat_index=(0,1,2,3,4), nthreads=nthreads, singleprec=True)

mreineck commented 5 months ago

You are right, and this should now be fixed! (I'm now wondering if I got the original zach1 version right ...)

mreineck commented 5 months ago

Let me come back to the single vs. double precision for a moment ...

It seems that pspy does all coupling-matrix related computations in real(8) aka double, and NaMaster computes the 3j symbols in double and the rest in flouble, which is presumably float in most applications.

Do you think we are safe in all "normal" situations if we use the NaMaster approach? The only thing I'm a bit worried about is the accumulation in the loops over el3 (or l1 in NaMaster).

zatkins2 commented 5 months ago

Which approach did you go with initially? I would agree with you that I would be wary of accumulating large sums in a single-precision buffer.

On my end, I think the most important thing is that the output buffer would be in single precision (if that's what was requested), and that results that have been fully accumulated in double precision are copied "on-the-fly" into the output buffer. What I mean by "on-the-fly" is if you are computing nspec*4 couplings, that you don't have nspec*4 intermediate double-precision buffers, but rather 1 double-precision buffer that gets reused nspec*4 times or something like that; otherwise, there isn't a memory savings for requesting single-precision since there must be enough memory available to hold the intermediate nspec*4 double-precision buffer.

Also, random question from earlier: in the tri method, does it no longer apply the 2l+1 factor to the output, since that 2l+1 factor is not symmetric, but to "unravel" it would most naturally assume symmetry?

mreineck commented 5 months ago

Which approach did you go with initially? I would agree with you that I would be wary of accumulating large sums in a single-precision buffer.

I'm currently accumulating into double precision numbers and, once the accumulation is finished, I'm copying to the output array, which can be single or double precision. At any one time, only a handful (<100) accumulations take place per thread, so additional memory consumption is not a concern.

As a rough benchmark, I'm now running full spec0and2 calculations (single-precision output) with lmax=10800 and nspecs=30 in 30min on my laptop (8 cores), and the process requires 36GB.

Also, random question from earlier: in the tri method, does it no longer apply the 2l+1 factor, since that is not symmetric?

That's correct.

zatkins2 commented 5 months ago

Sounds great.

With the 0.33.0 version of this interface, I got that nspec=15 and the same lmax was calculated in 10min on 10 cores on a Princeton cluster node (and, demonstrating the awesome utility of your implementation, nspec=1 was calculated in 3min, so very sub-linear scaling!)

mreineck commented 5 months ago

I hope that the timings on the Princeton cluster will improve somewhat with the current version - not dramatically, but it should be noticeable. The most important change will be the reduced memory consumption (factor 1/4 when using the tri method and single precision matrices).

mreineck commented 5 months ago

I have now removed the older spin0 and spin0and2 methods that returned square matrices; maintaining them in parallel to the triangular ones would add quite a lot of work for practically no benefit. The _pure method still exists, and I'm happy to look into further optimizations there once I understand the typical usage scenarios.

mreineck commented 5 months ago

Since the original suggestions are now implemented and (I hope) working properly, here a few follow-up questions:

what should I do about coupling_matrix_spin0and2_pure? Is this actually going to be used? If not, I'd probably comment it out for the time being. If it is used, should I make it return triangular matrices as well, for a more homogeneous interface (some of the matrices will be transposed however)?
Is there a need for an equivalent to pspy's calc_covariance functions?

@zatkins2, @xzackli any thoughts?

zatkins2 commented 5 months ago

Re the pure methods, I can't comment. I personally won't be using them, but they may be of use to others in ACT/SO working on dedicated B-mode analyses. I can ask some of my colleagues this week and report back to you (or have them comment). Or yes, Zack might have an opinion.

Can you link the calc_covariance method you are thinking of? I don't know it off the top of my head.

zatkins2 commented 5 months ago

(I just asked Susanna Azzoni, an SO BB group person, what the group does now; she says they use namaster. I will try to evangelize ducc to them, maybe they have a desire for faster code!)

mreineck commented 5 months ago

Can you link the calc_covariance method you are thinking of? I don't know it off the top of my head.

I was thinking about the code here: https://github.com/simonsobs/pspy/blob/master/pspy/cov_fortran/cov_fortran.f90 This is called mainly from pspy/so_cov.py and some pspy tutorials. The code structure is very similar to that of the the mode-coupling functions, so it should be straightforward to optimize it in the same way.

mreineck commented 5 months ago

(I just asked Susanna Azzoni, an SO BB group person, what the group does now; she says they use namaster. I will try to evangelize ducc to them, maybe they have a desire for faster code!)

If I read it correctly, namaster recently switched to ducc for its SHTs, so getting the mode-coupling matrices from ducc as well may not be such a big change. Let's see :-)

zatkins2 commented 5 months ago

I see. I would say that's less urgent (I only use the low-level calls to coupling matrices, not these wrappers)

mreineck commented 4 months ago

I have now merged the mcm_work branch into ducc0. We can start a new branch when more adjustments are necessary.

mreineck commented 4 months ago

I released ducc0 0.34 today, which contains the new functionality.

mreineck commented 4 months ago

I'll close this for now, since there is nothing concrete to do until we have some solid user experience. Please reopen (or open a new issue) if you have any other suggestions!

mreineck commented 1 month ago

Just curious: has this been successfully used in the meantime? Any lessons learned that suggest changing/enhancing the interface?

xzackli commented 1 month ago

@zatkins2 is leading the charge here, I think he can say more!

mreineck / ducc

Interface suggestions for new wigner-related calculations #26