ratt-ru / codex-africanus

Radio Astronomy Algorithms Library
https://codex-africanus.readthedocs.io
Other
19 stars 10 forks source link

Track possible multiple correlation support in nifty/ducc gridders #169

Open sjperkins opened 4 years ago

sjperkins commented 4 years ago

Description

  1. Nifty Gridder only supports gridding a single correlation
  2. MS data is (row, chan, corr)
  3. We index on corr to pass a single correlation to nifty which produces a non-contiguous array.

Solution

  1. Make an array copy
  2. Investigate whether nifty might be able to now support gridding multiple correlations.
mreineck commented 4 years ago

While the ducc gridder still does not support multiple correlations, it's on my todo list.

I assume that the most typical situation will be four correlations, i.e. four complex numbers for every entry in the MS instead of just one at the moment, correct? Will the weight array stay the same, or will it also increase by four?

sjperkins commented 4 years ago

Hi @mreineck. Thanks for keeping this on the horizon.

I would say one, two and four correlations are the typical cases. The WEIGHT and WEIGHT_SPECTRUM arrays always have a correlation dimension which is the same as that in the DATA, FLAG, CORRECTED_DATA etc. arrays.

See the CASA MSv2.0 spec for confirmation: https://casa.nrao.edu/Memos/229.html#SECTION00061000000000000000

From admittedly hazy memory, I think I just decided to pass non-contiguous arrays into the nifty gridder. I remember it sent many warning messages about non-contiguous arrays to the console, which I think you removed?

mreineck commented 4 years ago

OK, I'll see how that fits into the design. My hope is that I can use vector instructions when gridding/degridding multiple values that are at identical (u,v,w) locations. If that works, then a single pass doing four correlations might be noticeably faster than four separate passes doing the one correlation at a time.

Passing non-contiguous arrays is probably not such a big deal. There will be slightly more cache misses, but the effect on overall performance is probably small.

sjperkins commented 4 years ago

I would say one, two and four correlations are the typical cases.

@landmanbester, I can't imagine another RA use case, but pinging you in case there's something exotic that I may have missed

sjperkins commented 4 years ago

Passing non-contiguous arrays is probably not such a big deal. There will be slightly more cache misses, but the effect on overall performance is probably small.

This is useful to know. I'm currently re-implementing a Discrete Wavelet Transform in numba: https://github.com/ratt-ru/pfb-clean/pull/10 and the thought of applying the 1D DWT on non-contiguous data made me pause: The original implementation copies rows which are non-contiguous before applying the filter. Do you have any thoughts on this that you'd be willing to share on the matter in https://github.com/ratt-ru/pfb-clean/pull/10? I'm not asking for a code review -- I find this level of discussion useful.

mreineck commented 4 years ago

Whether unaligned accesses are a problem depends primarily on how many operations you do on an array element before going to the next. In the gridder, I read a value from the ms array and then grid it onto support**2 different locations (which are almost always in Level1 cache). So the penalty for the potential cache miss when accessing ms is rather small compared to the time required to actually grid it.

If this was a 1D gridder, things would already look worse, as cost now only scales linearly with support. And if you use the value only for a single arithmetic operation (as is the case in FFTs, for example), cache misses become really painful. I don't know the details of Discrete Wavelet Transforms, but I'd naively expect that they are more on the FFT side of things...

landmanbester commented 4 years ago

I would say one, two and four correlations are the typical cases.

@landmanbester, I can't imagine another RA use case, but pinging you in case there's something exotic that I may have missed

I think that about covers it. The weights are per correlation so, strictly speaking, you have to grid the correlations before converting to Stokes components. I don't know if it makes much of a difference but I think the diagonal components of the brightness matrix are always real-valued

bennahugo commented 4 years ago

Unaligned accesses can create a bit of heavoc when vectorizing code using special instruction sets. I had some speedup (not the full 8 fold when I did this vectorization by hand once but that was a long time ago). Compilers have likely become smarter in the mean time.

I don't know if numba exploits special instruction sets or not though

On Fri, 24 Jul 2020, 12:24 Landman Bester, notifications@github.com wrote:

I would say one, two and four correlations are the typical cases.

@landmanbester https://github.com/landmanbester, I can't imagine another RA use case, but pinging you in case there's something exotic that I may have missed

I think that about covers it. The weights are per correlation so, strictly speaking, you have to grid the correlations before converting to Stokes components. I don't know if it makes much of a difference but I think the diagonal components of the brightness matrix are always real-valued

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ska-sa/codex-africanus/issues/169#issuecomment-663472160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6TBLLKRRWYERHCSZHLR5FOOPANCNFSM4KFGJBBQ .

mreineck commented 4 years ago

Unaligned accesses can create a bit of heavoc when vectorizing code using special instruction sets.

Yes. In my particular situation it wouldn't be a problem though, since during gridding I'd basically have a (potentially) unaligned load per visibility, and the remaining support**2 memory accesses would be aligned. I still don't expect anything near an 8-fold speedup, however...

bennahugo commented 4 years ago

Yea I don't expect it to make any difference in your case. Sorry I forgot you wrote a stacking gridder. With wprojection it makes more of a difference.

On Fri, 24 Jul 2020, 15:19 mreineck, notifications@github.com wrote:

Unaligned accesses can create a bit of heavoc when vectorizing code using special instruction sets.

Yes. In my particular situation it wouldn't be a problem though, since during gridding I'd basically have a (potentially) unaligned load per visibility, and the remaining support**2 memory accesses would be aligned. I still don't expect anything near an 8-fold speedup, however...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ska-sa/codex-africanus/issues/169#issuecomment-663534597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6TXUAQYBQNG3T3VEKTR5GC7ZANCNFSM4KFGJBBQ .

sjperkins commented 4 years ago

Compilers have likely become smarter in the mean time. I don't know if numba exploits special instruction sets or not though

numba defers to LLVM for SIMD: http://numba.pydata.org/numba-doc/latest/user/faq.html?highlight=simd#does-numba-vectorize-array-computations-simd

I'm seen numba generate vectorised assembly code. Something like the following should show you the ASM:

@numba.njit
def function(...):
   pass

function(...)
print(function.inspect_asm())
mreineck commented 4 years ago

After finding another way of vectorizing the inner loops of the gridding/degridding functions, I think I won't support simultaneous processing of multiple correlations after all ... The issue is that if we process, say, four correlations together, then the amount of data which is touched in the innermost loops also becomes four times larger, and this can make the difference between very few and very many L1 cache misses. Also it would mean reserving memory for 4 complex arrays of size nu*nv instead of just one, which is probably not the best thing to do for really large dirty images. So unless there are important reasons for switching, I propose to keep the current implementation.

(It's of course true that with separate processing we have to do the same kernel evaluations repeatedly, but I don't think this is an important factor any more.)

bennahugo commented 4 years ago

You don't need to grid all the data unless you are forming multiple stokes parameters (which isn't the typical use case). Handling two correlations at once is however quite possible and avoids the large memory reservations you need: See for instance the pull request going in at the moment for my facet based numba gridding and degridding scheme: https://github.com/ska-sa/codex-africanus/blob/681ca8b46dad4693ca0484608c863f9f9ba77f08/africanus/gridding/perleypolyhedron/policies/stokes_conversion_policies.py

The one exception is of course DFT based RM synthesis (e.g. Brentjens et al.) you will need to allocate nu*nv for 3 grids (IQU)

Cheers

On Tue, Aug 4, 2020 at 2:19 PM mreineck notifications@github.com wrote:

After finding another way of vectorizing the inner loops of the gridding/degridding functions, I think I won't support simultaneous processing of multiple correlations after all ... The issue is that if we process, say, four correlations together, then the amount of data which is touched in the innermost loops also becomes four times larger, and this can make the difference between very few and very many L1 cache misses. Also it would mean reserving memory for 4 complex arrays of size nu*nv instead of just one, which is probably not the best thing to do for really large dirty images. So unless there are important reasons for switching, I propose to keep the current implementation.

(It's of course true that with separate processing we have to do the same kernel evaluations repeatedly, but I don't think this is an important factor any more.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ska-sa/codex-africanus/issues/169#issuecomment-668562110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6W2S756RVIYQBVMW33R674G3ANCNFSM4KFGJBBQ .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

mreineck commented 4 years ago

I see, thanks! But to be honest, I think that all this trickery could also be done outside the gridder proper, and if possible I'd like to keep the code as unaware as possible of the details of the data it is processing. Currently the library does not know (and does not care) whether it is dealing with Stokes parameters, correlations or something completely different, which I think is a good thing.