Writing data with fewer correlations than are present in the MS

ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS

https://dask-ms.readthedocs.io

Other

19 stars 7 forks source link

Writing data with fewer correlations than are present in the MS #93

Open JSKenyon opened 4 years ago

JSKenyon commented 4 years ago

dask-ms version: 0.2.3
Python version: 3.6.9
Operating System: Pop!_OS 18.04 LTS

This is feature request, not a bug.

xarray offers the following awesome functionality:

updated_xds = xds.sel(corr=[0, 3])

which in this example instantly makes a new xds using only some of the correlations. This is great because it is applied to the entire xds, which means all the fields remain consistent. The only drawback comes when attempting to write MS columns, as a field with two correlations on the xds cannot be written to a column containing four correlations. It would be really cool if dask-ms could support this.

My instinct is that this might become simple if the corr dimension is given a coordinate. That way there is a paper trail showing the correlations which have been selected out, and consequently a way to determine how to store them.

sjperkins commented 4 years ago

Temporary work-around is to use xarray.Dataset.reindex

JSKenyon commented 3 years ago

Just a brief follow-up here. xarray.Dataset.reindex does work (and is awesome) but there is a slight problem. reindex uses a fill value for the added elements. This can lead to on-disk data being overwritten (e.g. writing two correlations, reindexed to four, back to a column will overwrite values with the fill-value). There are other approaches:

Slice the data rather than selecting it. This increases code complexity as the slices then need to be used everywhere.
Re-read the data and combine. This will lead to more data being in memory, perhaps unnecessarily.
Support writing of slices back to disk. This will make writes slower but is likely the "easiest" from a user perspective. I do not know how tractable this is for other formats.

JSKenyon commented 3 years ago

Just checking in here - this is becoming increasingly necessary. I might have to start considering option two above which is really inelegant. @sjperkins How much work do you think would be involved in making the writes aware of the correlation axis and using putcolslice appropriately?

sjperkins commented 2 years ago

Following on from our online conversation

Currently dask-ms supports ranged writes on the row dimension only.
In principle, supporting this would require ranged writes on corr, for instance. chan might also be useful.
Under the hood, in CASA Tables, the actual operation would be performed using a putcolslice(column, data[rr:rr + rl], blc, trc, startrow=rs, nrow=rl) command.
Thus, instead of merely slicing out rows for writing, dask-ms would have to support slicing out arbitrary dimensions for writing (particularly corr).

A way forward may be possible by passing in blc, trc ranges into the blockwise call, rather than individual values, as is presently done here https://github.com/ska-sa/dask-ms/blob/55c987e5f00c24a82c16363f681a966e45590e06/daskms/writes.py#L629-L635