BDA - Inconsistent intervals in data prior to averaging.

JSKenyon commented 4 years ago

In order to calibrate BDA data (particularly in the context of Jones Chains) it will be necessary to create an implicit mapping from the samples we have back to a more regular grid. This is not the averager's responsibility, but what the averager does when BDAing the data will be important.

I am currently looking at using the greatest common divisor (GCD) of the BDA interval column to define the implicit high resolution grid. For a single MS with uniform - either identical or all integer multiples of some value - intervals (prior to averaging), this should work reliably. However, I know that it is possible (if not common) to have non-uniform interval values in the pre-BDA data. This is problematic as it means that the post-BDA intervals won't necessarily have a sensible GCD - it might not exist/be too small to be useful.

My question is then simply what is sensible in this situation and what does the averager do?

o-smirnov commented 4 years ago

Yeah correlators do funny stuff to dump rates sometimes. If we assume a regular time grid, we will be bitten on our backsides by a non-compliant MS very quickly.

What does QuartiCal currently do if a full-res MS has an irregular time grid?

o-smirnov commented 4 years ago

By the way, another avenue to consider -- we can always have xova add extra columns or metadata that aids in mapping back onto the original pre-BDA grid.

However, given that the MSv2 spec says nothing about regular time grids whatsoever, first prize would still be to deal with that as gracefully/sensibly as possible.

JSKenyon commented 4 years ago

Probably runs through without complaint but likely doesn't do the right thing. In handling the BDA case this may resolve itself, but it is - broadly speaking - a difficult problem. At some point I believe there has to be an implicit regular grid - this could technically be at a higher resolution than the data (hence looking at a GCD approach). However, without this implicit grid, it becomes almost impossible (or at the very least impossibly complicated) to correctly map multiple gains against each other.

JSKenyon commented 4 years ago

Let's consider a few examples:

The MS has only a single unique interval value. Any BDA averaging interval will be some multiple of the original and a GCD approach is perfectly appropriate.
The MS has several unique interval values, but they are all integer multiples of each other/some number of a sensible size. A good example would be [1, 1.5]. Their GCD is 0.5 which, while smaller than 1, is not going to blow up a mapping.
The MS has several unique interval values, but they are integer multiples of a number of small size. An example would be [1, 1.01]. Their GCD would be 0.01, which would lead to a hundred-fold increase in the size of the mapping. Depending on the severity, this is the worst case scenario - think about the limit as the GCD tends to zero. We would need an infinitely large map.

JSKenyon commented 4 years ago

By the way, another avenue to consider -- we can always have xova add extra columns or metadata that aids in mapping back onto the original pre-BDA grid.

This is likely unnecessary (bar perhaps storing the original unique interval values, but these can be worked out). I am perfectly happy with the first two examples above. I am just wondering whether the last one is perhaps asking a bit much. Infinitely flexible code usually suffers somewhere. What could work in the third case is a resampling onto a new grid, but I feel I need to stress vehemently how much of a bad idea I think that is. This would essentially mean QuartiCal would be doing averaging on top of averaging and would require cake-unbaking of nightmarish proportions to produce a visibility output.

sjperkins commented 4 years ago

My question is then simply what is sensible in this situation and what does the averager do?

In the case of standard averaging, dt = (TE - (IE / 2)) - (TS - (IS/2)) is used to decide how many seconds worth of information is present in the bin. (TS = time_start, TE = time_end, IS = interval_start, IE = interval_end). This is explained further in.

https://codex-africanus.readthedocs.io/en/latest/averaging-api.html

In the case of BDA, the same dt is used to calculate the baseline speed (and hence the amount of decorrelation in the bin).

https://github.com/ska-sa/codex-africanus/blob/7418c6f2ef4872dacd5238fb76e861348c7b3be7/africanus/averaging/bda_mapping.py#L167-L180

The two approaches are in some ways very similar. The end result is the same: If the row matches the criterion for inclusion in the bin, the INTERVAL will be added to the bin's INTERVAL sum, which will simply be exported to the output row.

sjperkins commented 4 years ago

I wonder if this may be an appropriate data structure to help with BDA mappings: https://en.wikipedia.org/wiki/Interval_tree

sjperkins commented 4 years ago

Pure Python implementation here: https://pypi.org/project/intervaltree/

sjperkins commented 4 years ago

A numba implementation is plausible:

http://numba.pydata.org/numba-doc/latest/extending/interval-example.html

sjperkins commented 4 years ago

Another thought occurred: Are we not regridding data from a set of intervals onto another set of intervals?

JSKenyon commented 4 years ago

Pretty much - in the BDA case we need to map irregular samples in time onto a regular gain grid.

sjperkins commented 4 years ago

I wonder if scipy's griddata would work?

JSKenyon commented 4 years ago

That would be an explicit regridding which is what we want to avoid. By creating the mappings, the operation becomes implicit and there is no need to make the data larger. For reference, the solution intervals may be smaller than the BDA averaging interval. Additionally, the regridding would have to be onto the GCD or some preselected convenient interval as otherwise it would vary per gain. Whilst I think griddata could work, I suspect it wouldn't be the best solution.

o-smirnov commented 4 years ago

No no, we're definitely not regridding! We're always aggregating whole intervals together. It's a discrete operation, if that makes sense. So it's all in the indexing...

OK, to summarize what I think we've agreed on:

Before we even consider BDA

We won't assume any regularity in the time grid (until we get to Planck time... but we do want to 64 bits of fewer in our indices...)
We may assume that the time grid, irregular though it may be, is mostly shared across baselines.
We can therefore speak of "raw" timeslot numbers (i=0 ... N-1), and corresponding timestamps t[0]...t[N-1] and intervals dt[0]...dt[N-1] (corresponding to TIME and INTERVAL columns).
Not every baseline has to be present in every timeslot, and some may be flagged (absent or flagged makes no difference)
QuartiCal shall define solution intervals in terms of time span, and construct a (many-to-1) mapping from timeslot #i to interval #j.

After BDA

Each baseline is averaged down to a set of aggregated timeslots, potentially different per baseline. Aggregated timeslot #j corresponds to a consecutive chunk of M raw timeslots, i.e. i0[j]:i1[j]+1. Its nominal timestamp (TIME) is given by

(t[i0[j]] - dt[i0[j]]/2 + t[i1[j]] + dt[i1[j]/2]) / 2,

and its interval (INTERVAL) is given by the sum of the raw intervals. (By contrast, TIME_CENTROID and EXPOSURE will be the effective timestamp and exposure time, i.e. they will be affected by weights and flagging in the averaging).

QuartiCal has access to the mapping from j to i0[j], i1[j] per baseline, and also to the original raw time and interval arrays. We can debate how to pass this information. For regularly sampled MSs, @JSKenyon's GCD approach will work. In the future, xova can provide a BDA subtable with this mapping (and/or extra main columns), etc.
Given such a mapping, QuartiCal can treat each aggregated sample as M identical samples given on the original raw time grid.

Is this a fair summary?

JSKenyon commented 4 years ago

Summary seems fair. The only point I might argue is this:

QuartiCal has access to the mapping from j to i0[j], i1[j] per baseline, and also to the original raw time and interval arrays. We can debate how to pass this information. For regularly sampled MSs, @JSKenyon's GCD approach will work. In the future, xova can provide a BDA subtable with this mapping (and/or extra main columns), etc.

Unless I have misunderstood, I do not have access to the raw time and interval arrays - only the BDA time and interval arrays.

o-smirnov commented 4 years ago

Well let me rephrase that as a question. Do you need access to these arrays? Because it can be arranged.

JSKenyon commented 4 years ago

Not really - I think I actually just need access to whatever the constituent intervals were. Let's assume that we add a BDA subtable which contains all the unique interval values present in the MS prior to BDA. Then we need only have a (row, num_unique_intervals) column in the main table that stores the number of constituent elements. This should be quite small, as 8-bit uints would do (I cannot imagine averaging more than 255 intervals together) and should be sufficient to infer the required mapping.

JSKenyon commented 4 years ago

All of this subject to further experimentation/implementation.

ratt-ru / xova

BDA - Inconsistent intervals in data prior to averaging. #16

Before we even consider BDA

After BDA