ratt-ru / QuartiCal

CubiCal, but with greater power.
MIT License
8 stars 4 forks source link

Use units to specify gain timeslots and frequencies #33

Closed bennahugo closed 3 years ago

bennahugo commented 4 years ago

Hi this is just a request to change to using units (as is done for the tile size for the time dimension) on the gains. I think this also makes more sense for non-regularly sampled data?

JSKenyon commented 4 years ago

Yeah - this is on my to-do list. It will also be important for irregular BDA data (when the high resolution data was already irregular).

JSKenyon commented 3 years ago

I am working on this but it actually requires a somewhat frustrating number of changes. Where previously we could get away with knowing only array dimensions when determining solution intervals, we now have to interact with the data. This also complicates a fair amount of support code (stat computation for example) as we cannot simply compute solution interval mappings on the fly. These unique time to bin mappings will need to be passed around.

JSKenyon commented 3 years ago

I think I have a working implementation of this, but it has exposed a problem. When specifying solution interval in time, it is possible that one chunk of input data may have many more gain solutions associated with it than another. Zarr does not handle this and QuartiCal will crash when attempting to write out the gains. I will need to think about solving this one. The easiest approach would be to simply rechunk the gain arrays before writing them. It may also be possible to consider other formats

o-smirnov commented 3 years ago

As they say in the military, no plan survives contact with ~the enemy~ real data...

JSKenyon commented 3 years ago

Some mucking around has revealed that it is possible to write the gains out as netCDF provided one uses engine="h5netcdf", invalid_netcdf=True. This is because netCDF doesn't support complex numbers by default. Similarly, it needs to be loaded with engine="h5netcdf". This doesn't preserve the chunking information like Zarr does, but that is not a deal breaker as we could write that as an attribute and restore it on load. I am not comitting to this just yet, but it is definitely an option.

o-smirnov commented 3 years ago

Why must the gains have a specific chunking stored? I can imagine a gains DB being passed around, and loaded in a different chunking scenario...

JSKenyon commented 3 years ago

Absolutely - hence it not being a deal breaker. It was a just a nice-to-have as it makes it a little clearer how the associated data was carved up. But you can do a xarray.open_dataset("foo.nc", chunks={"time_int": (20, 30)}) without issue when using netCDF.

sjperkins commented 3 years ago

One can attach metadata to xarray datasets. It may be possible to store the chunks as netcdf dataset metadata, although there may be a chicken and egg situation here.

JSKenyon commented 3 years ago

I have opted to go with rechunking the time axis of the gains before writing them. This is not set in stone, and needs to be monitored to confirm that it doesn't muck around with ordering. That said, my basic tests seem almost unaltered so it should be ok. May be more problematic in the distributed case but we can cross that bridge when we come to it.

I have pushed my changes to a branch. Everything seems in order for specifying time in seconds. Need to do frequency before making a PR but that might have to wait till after next week.

JSKenyon commented 3 years ago

I should note that this completely breaks a whole host of tests which I will need to fix. Will likely wait until I have the frequency stuff working too so that I don't end up having to fix the tests multiple times.

JSKenyon commented 3 years ago

Ok, there is now a PR lurking for this. Will merge into master soon, just giving myself the remainer of the day to think of things which may be wrong. All the tests are passing, though they had to be heavily altered.