Nested parallelism doesn't work as expected for BDA data

ratt-ru / QuartiCal

CubiCal, but with greater power.

MIT License

8 stars 4 forks source link

Nested parallelism doesn't work as expected for BDA data #42

Closed JSKenyon closed 3 years ago

JSKenyon commented 3 years ago

Nested parallelism seems to cause a slow-down for the BDA kernels specifically. This requires further investigation and confirmation that the non-BDA case continues to work as we cannot discount upstream regressions.

Potential causes:

Aliasing in the dask graph, resulting in a weirdly shared array.
Poor cache performance due to the additional BDA mapping (jumping around input data).
Regression (unlikely).

JSKenyon commented 3 years ago

I am pretty much convinced that this is cache related. The problem is that the way we handle BDA data creates an implicit high resolution time grid. This in itself is lightweight, as it is really just a row mapping. The issue stems from the fact that a single row may be mapped into several solution intervals. What this means is that a visibility at row N might be required for the same solution interval as the visibilities at row N + offset. The moment this offset is large enough, cache performance is going to take a hit. The nested parallelism will make matters worse as we go parallel over solution interval in time. This means there will be even more hopping around the input data.

Possible solutions:

Allow for nested parallelism over frequency solution intervals. This is probably necessary regardless. This should improve things as the data will be contiguous in frequency.
Consider alternate traversal patterns. It may be better to map a row into every bin, rather than looping over bins and finding all the data. This is problematic from a thread safety perspective.
Improve chunking in the dask layer. In principle, the nested parallelism may be unnecessary for BDA data as it is so much smaller. We can potentially chunk far more if we know where the averaging sliced the data. Chunking along frequency might just work but is untested.

o-smirnov commented 3 years ago

Isn't this a luxury problem really? If you can already calibrate a BDAd MeerKAT MS in minutes, the pipeline bottleneck is clearly elsewhere...

JSKenyon commented 3 years ago

Of course, this is more for my own interest than a pressing need. Felt I should document that this exists/my findings. Maintaining these nested paths is something I feel relatively strongly about as it is the only way I can see to cope with very large chunk sizes. This is more relevant for non-BDA data. But I likely won't push too hard on this for now - just wanted to see if there was an easy win to be had.

JSKenyon commented 3 years ago

Just leaving myself a reminder - this is likely less to do with the peculiarities of the BDA case and more to do with the way in which parallelism is handled in Numba. It boils down to prange taking its argument and slicing it into nthread pieces. As an example, for prange(4) with two threads, thread one would be allocated [0,1] and thread two would be allocated [2,3]. This works well for small inputs where the majority of computation can fit in cache. However, when your inputs are large, it is likely that the threads will now be working on completely different memory regions and cache performance will degrade. In this case, returning to the example it would be better to allocate [0,2] to thread one and [1,3] to thread two. Or alternatively have the threads greedily consume from a queue. I am not sure if this behaviour is available in Numba at present.

JSKenyon commented 3 years ago

This has been mostly fixed. Can still be improved upon when I improve generic internals.