"Classic" 4pi beam convolution and interpolation

mreineck commented 4 years ago

Provide a C++ library (with Python interface), which generates a total convolution data cube from a set of sky and beam a_lm and computes interpolated values for a given list of detector pointings.

Development of this code happens in the directory interpol_ng of https://gitlab.mpcdf.mpg.de/mtr/cxxbase.

Algorithmic details:

the code uses libsharp SHTs to compute the data cube, which is more efficient than the libconviqt approach
shared-memory parallelization is provided via standard C++ threads.
for interpolation, the algorithm and kernel described in https://arxiv.org/abs/1808.06736 are used. This allows very efficient interpolation with user-adjustable accuracy.
no MPI support is planned at the moment, because I do not expect this to scale very well for LiteBIRD-typical resolutions. For massively parallel computation, I recommend processing different detectors simultaneously on different MPI tasks.

External dependencies:

pybind11 (https://github.com/pybind/pybind11) for the Python glue code

mreineck commented 4 years ago

The algorithm is in place and passes basic tests. I'd be happy about any feedback!

ziotom78 commented 4 years ago

Hi @mreineck , very intriguing! A few comments/questions:

Do you plan to release the package using the Python Package Index, so that it will be available through pip? Or do you think that it would be better to merge the source code in this repository, once it works?
I fully agree about not supporting MPI directly in the C++ code, as we are surely going to heavily parallelize TOD generation among detectors using MPI. If the beam-convolution algorithm remains serial, there will be no real loss.
If I understand correctly, the code calculates the data cube but does not perform the convolution on every time sample. Is this correct? If it is so, how complicated will be to implement the latter?

mreineck commented 4 years ago

Do you plan to release the package using the Python Package Index, so that it will be available through pip? Or do you think that it would be better to merge the source code in this repository, once it works?

In the longer term, this kind of public release is an option, but most likely not for the moment. The interface is nowhere near finished (actually, it critically depends on feedback I hope to get from other people), and before that has happened, I'd probably have to bump the major version number every day. I'm not sure what's the best way to proceed here. Having a for of cxxbase inside litebird_sim would be fine with me, in any case.

I fully agree about not supporting MPI directly in the C++ code, as we are surely going to heavily parallelize TOD generation among detectors using MPI. If the beam-convolution algorithm remains serial, there will be no real loss.

For LiteBIRD that's most likely true. For other experiments with higher resolution and potentially less regular beams it may actually become necessary at some point to add MPI to the C++ code itself, since the data cube may no longer fit into the the memory of a single node.

If I understand correctly, the code calculates the data cube but does not perform the convolution on every time sample. Is this correct? If it is so, how complicated will be to implement the latter?

It actually does both. The library provides a Python class, which is initialized by passing sky and beam a_lm and some additional parameters; from this information, the constructor computes the data cube. The class has an additional interpol method, which accepts pointings (currently theta, phi, psi tuples, may change in the future) and returns the interpolated values.

mreineck commented 4 years ago

If I understand correctly, the code calculates the data cube but does not perform the convolution on every time sample. Is this correct?

Re-reading that comment, I realize that I may have misunderstood what you were saying ... Doing an explicit convolution for each individual sample of a time stream would be prohibitively expensive. It can be done for single samples though, and the code currently has support functions for it (rotate_alm), but they are meant strictly for testing purposes and will not be part of the public interface.

ziotom78 commented 4 years ago

No, I was just referring to the interpolation, not the full convolution. Thanks!

mreineck commented 4 years ago

I think the code now corks correcly for unpolarized input (i.e. only slmT and blmT are supplied), but I'd be happy about independent confirmation. @tskisner, @keskitalo, I'd value your opinion very much, not only on correctness, but also on implementation details if you notice anything you don't like.

keskitalo commented 4 years ago

@mreineck The lack of MPI support could be an issue going forward. The data redistribution in libconviqt allows each MPI task to focus on a small number of isolatitude rings. Without redistribution, each process will have data that spans a significant range in latitude. I have two questions:

Will the large latitude range make the convolution more costly?
Will the large latitude range make the locally stored data cube too large?

mreineck commented 4 years ago

Hi Reijo, just a very quick answer for now, more details later :) The algorithm needs to perform an FFT in latitude direction at one point. Later redistribution depending on latitude is of course possible, but won't be necessary for LiteBIRD. Overall the memory consumption of the new algorithm will be lower, since an oversampling factor of only 1.2-1.5 will be required, instead of the customary factor of 2 that was used by LevelS/conviqt.

keskitalo commented 4 years ago

Thanks Martin. So it is not an issue that each process performs a (very) large number of short FFT:s instead of just a handful of large ones? I would expect some latency and efficiency penalties to apply.

mreineck commented 4 years ago

The size of the FFTs does not change with the total number of MPI tasks.

For the data cube generation, the following steps have to be carried out:

For each k in [0;kmax]

build intermediate a_lm from s_lm and b_lm
perform one SHT of spin k with band limit lmax onto an equidistant grid of (lmax+1; 2*lmax+1) grid points
transform the resulting grid to Fourier domain (requires extension to 2pi in theta domain)
apply correction factors to compensate for the interpolation kernel
do zero-padding (typical factor of 1.5)
FFT back to position space

A very rough CPU time estimate for this setup is 5*(kmax+1)*(lmax/2000)**3 CPU seconds, measured on a 5 year-old laptop.

Each step benefits from shared-memory parallelization, and since the work for every k value is independent, it can be parallelized trivially over kmax+1 compute nodes; higher parallelization will not really help performance, and it won't be necessary for memory reasons either.

Once the above setup step is done, the data cube can be sliced in as many different theta regions as desired; this is a big Alltoallv communication.

The interpolation step will perform at about 500000 TOD samples per core per second, varying slightly with the desired accuracy.

So I'm pretty confident that the code can be MPI-parallelized at need; but this is low priority, since LiteBIRD has a resolution which won't require lmax beyond 1000 or so and has a pretty low sampling frequency. So the best approach here (I think) is to process one detector per node, and compute TOD for many detectors in parallel instead.

mreineck commented 4 years ago

I added some basic documentation to the module today, and I think this should now be ready for independent testing (first for correctness, and afterwards for performance/scaling).

Concerning polarization and HWP, I convinced myself that it will be sufficient to store separate data cubes for T, E and B components - everything else can be quickly derived from those.

Currently these three data cubes would have to be generated by creating three independent PyInterpolator objects. As soon as I have confirmation that the code works correctly, I'll add a mode which generates all three cubes within one and the same object; this will avoid redundant computation of interpolation coefficients and improve performance.

mreineck commented 4 years ago

The software can now deal with multiple a_lm sets and can process them either separately or add the contributions together. The docstrings have been updated accordingly.

mreineck commented 4 years ago

I tuned the interpolation algorithm quite a bit, making use of vector instructions if available. On my laptop I now see performances of roughly 1.5 million interpolated samples per second per core, with beammmax=13 and a kernel support of 7x7.

mreineck commented 3 years ago

The algorithm has now been improved so that it is using NFFT interpolation also in psi direction. This makes interpolation speed independent of beammmax. I'm discussing with the TOAST developers about potential inclusion of the code into TOAST.

mreineck commented 3 years ago

See https://github.com/hpc4cmb/toast/issues/373

mreineck commented 3 years ago

Performance is now up to roughly 5 million samples per second per core. I can't wait to see this work on real data!

ziotom78 commented 3 years ago

Wow, this is quite impressive! I'll play with it in the next days and see the best way to integrate it in litebird_sim (through TOAST3 or directly).

mreineck commented 3 years ago

Great! I think that for LiteBIRD purposes (if you plan direct integration), the "simple" interface should be totally sufficient (this is the Interpolator class). I worked on the documentation yesterday, but maybe the most intuitive way to learn about the interface is to look at python/demos/totalconvolve_demo.py in the ducc repository. Please let me know if any clarifications are needed in the docs!

mreineck commented 3 years ago

Merged into Toast 3, so I think we can close this

litebird / litebird_sim

"Classic" 4pi beam convolution and interpolation #12