Closed JSKenyon closed 1 year ago
Ok, I think that this is ready for another pair of eyes, if only to sanity check what I have done so far. I think that it is pretty simple. One thing to note is that I elected not to use the current CLI infrastructure. I did make an attempt but ran into issues with nested subparsers.
Currently the CLI is very basic and only provides the option to stat
or rebase
a fragment. The first of these simply reports the parents of the target fragment. The second allows a user to modify the parent in place. This is useful if you want to exclude bad/irrelevant parents.
The CLI could optionally be extended with the following, more complicated functionality:
merge
: Write the contents of a fragment and its parents back to the root. This shouldn't be too difficult but can possibly wait until we rework __dask_ms_metadata__
i.e. it will be much easier if all the info to reproduce the appropriate xarray.Dataset
objects is present in the metadata. This is important functionality as applications which don't use dask-ms
directly won't be able to utilize the fragments directly.composite
: Produce a new fragment which cherry-picks data variables from multiple fragments. This is less urgent but may become important if users need to mix and match state from various fragments e.g. retaining the newest version of CORRECTED_DATA while rolling back the flags to an earlier fragment. This may be easier than merge
as fragments are always zarr
, so the ordering/grouping is implicit in the way the data is stored. The difficulty here will be the selection mechanism/CLI interface. This PR doesn't depend on #284 but that PR is likely also required for this functionality to be exploited as it is sometimes necessary to rechunk data being written to a fragment due to zarr chunk size limits.
Could you also please rebase this PR on master?
I have rebased to master
. I hope I did it correctly - I haven't had much practice with rebase.
[x] Tests added / passed
If the pep8 tests fail, the quickest way to correct this is to run
autopep8
and thenflake8
andpycodestyle
to fix the remaining issues.[x] Fully documented, including
HISTORY.rst
for all changes and one of thedocs/*-api.rst
files for new APITo build the docs locally:
This PR is a WIP which investigates reading data from multiple sources (with potentially different backends) and utilising
xarray
functionality to merge the resulting datasets dynamically. Practically, this makes it possible to read the static contents of a measurement set e.g. DATA, UVW from one location (e.g. a read-only s3 bucket) and the mutable contents such as FLAG from another location. This may make it possible to implement a basic versioning system in which we create proxy datasets which hold some (mutable) data, but which point back at some parent object from which the remaining data can be retrieved.