Arbitrary internal parallelism

This PR addresses a long-standing issue - what do we do when we have very large solution intervals? QuartiCal already has the ability to process solution intervals in parallel. However, the limiting case (which is not entirely unusual) of (inf, inf) solution intervals leaves us with no (easy) axis over which to do things in parallel.

I have attempted many different approaches (see branch name), all of which had a sharp edge such as performance degradation or increased memory footprint. The approach in this PR is the culmination of all those previous attempts and should allow for an arbitrary degree of internal parallelism with low to virtually non-existent memory and compute overheads.

For the sake of posterity, I will quickly detail the idea behind the approach (which is actually fairly simple). The main driver behind these changes was the realization that accumulation over solution interval is completely arbitrary i.e. we are free to sum the elements contributing to a solution interval in any order and (more importantly) need not accumulate them into a single element immediately. QuartiCal now leverages this behavior to create an upsampled (slight misnomer) grid of solution intervals in time if and only if we are in a regime where we need to partition the problem into something we can process in parallel. This means that when we have a reasonable number solution intervals, the new code effectively does nothing i.e. has virtually no impact on memory or performance. However, when we have very large solution intervals they will be automatically partitioned into sub-intervals which can be processed in parallel. This has slight memory and compute overheads which scale with the inverse of the number of solution intervals i.e. the memory and compute overheads are are at their worst when we have (inf, inf) solution intervals. This is, however, largely irrelevant as (inf, inf) solution intervals already imply a memory reduction of order n_time*n_chan*n_ant. The potential n_thread increase is negligible next to this reduction. The additional compute is associated with accumulating the sub-intervals onto the so-called native intervals and is very cheap as it happens at the resolution of the solution intervals i.e. the accumulation is only over a small array.

There are still further improvements which can be made should the need arise, but I am relatively confident that this should suffice for most reasonable use cases.

ratt-ru / QuartiCal

Arbitrary internal parallelism #172