ratt-ru / dask-ms

Implementation of a dask/xarray dataset backed by a CASA MS
https://dask-ms.readthedocs.io
Other
19 stars 7 forks source link

Dataset versioning #235

Open sjperkins opened 2 years ago

sjperkins commented 2 years ago

Support Dataset Versioning. Briefly:

  1. Datasets should provide a unified view over multiple datasets on disk.
  2. This will allow users to iteratively reduce their data.
sjperkins commented 2 years ago

Three approaches thus far

1. Store Link to parent dataset in metadata

Each dataset can store a link to the parent dataset in metadata

2. Store links to all previous datasets in metadata

Slight modification to 1. I prefer this as it provides a less brittle way to specify dataset concatenation.

3. Store all datasets in a single directory.

Each dataset is stored in a sub-directory of a parent directory. The advantage of this approach is that there's less conceptual fragmentation compared to the first two approaches where url links may or may not exist.

sjperkins commented 2 years ago

Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG.

/cc @o-smirnov @JSKenyon @landmanbester

o-smirnov commented 2 years ago

I vote #2!

Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG.

Well that's exactly the idea, isn't it... new columns live in the "new" version, and all other columns live in the "old" version. So the "new" version is not self-contained without the old dataset also available, but that's by design.

There's another use case for this that feels very related, and that is SSD caching. We've garnered new appreciation recently for how slow MS reads are (see https://github.com/ratt-ru/systems/issues/82), and I'll add SSD scratch filesystems to a few nodes to address this. The problem then is managing the scratch space, since it is necessarily much smaller. I think dask-ms-based tools can be made to be much more cache-friendly by implementing the following logic:

I think this can provide a very smooth user experience for dask-ms-based tools. If you have access to fast disk, you create your "fast MS" linking back to your original slow MS, and as long as you're actively using it, it stays on fast disk -- and if you stop molesting the data for a while, it makes its way back to slow storage eventually without you or the sysadmin worrying too much about it.

(This will also be very useful in the cloud context -- I think things like S3 storage come in hierarchies of access speed...)

Thoughts?

sjperkins commented 2 years ago

I vote https://github.com/ratt-ru/dask-ms/pull/2!

Yep that's my vote too

Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG.

Well that's exactly the idea, isn't it... new columns live in the "new" version, and all other columns live in the "old" version. So the "new" version is not self-contained without the old dataset also available, but that's by design.

I more concerned by the nature of the CASA Measurement Set in the sense that if you create an MS, say with pyrap.tables.default_ms, all the other required columns are also created with zeros if columns have a Tiled Storage Manager, or empty otherwise. Thus, creating a secondary MS holding new CORRECTED_DATA and FLAG columns will also create empty/zeroed TIME, ANTENNA1, ANTENNA2, UVW columns which will, annoyingly, override the correct values in the previous MS.

To work around this, it should be possible to just create the new versions as plain CASA tables (which won't have required columns). Or perhaps modify the MS descriptor to just require the new columns... This is all workable but I also don't want things to get too complex.

sjperkins commented 2 years ago

The zarr and parquet formats don't have these issues because there no distinction between a plain dataset and an MS dataset.

sjperkins commented 2 years ago

(This will also be very useful in the cloud context -- I think things like S3 storage come in hierarchies of access speed...)

Yep, these are all exciting suggestions. In fact, one thought that comes to mind is datasets composed of CASA tables for the original and zarr/parquet tables for the deltas (fast versions).

There's another use case for this that feels very related, and that is SSD caching. We've garnered new appreciation recently for how slow MS reads are (see ratt-ru/systems#82), and I'll add SSD scratch filesystems to a few nodes to address this. The problem then is managing the scratch space, since it is necessarily much smaller. I think dask-ms-based tools can be made to be much more cache-friendly by implementing the following logic:

  • A "fast new" version of the MS is created on SSD space, with all columns empty, linking back to the "old slow" MS on HDD. Some columns (e.g. DATA, CORRECTED_DATA) are marked as "caching". There should a command-line tool for this.
  • When dask-ms reads the fast MS, non-caching columns are pulled straight from the slow version, while caching columns are first copied from the slow version to the fast version.
  • Likewise for writes -- caching columns are written to the fast version only, non-caching columns go straight back to the slow version.
  • A command-line tool can look for "caching columns older than X days" or "caching columns not accessed for Y days", write them back to the slow version, and delete them from the fast version. This can be put in a cron job.

I think this can provide a very smooth user experience for dask-ms-based tools. If you have access to fast disk, you create your "fast MS" linking back to your original slow MS, and as long as you're actively using it, it stays on fast disk -- and if you stop molesting the data for a while, it makes its way back to slow storage eventually without you or the sysadmin worrying too much about it.

Yep, this is all sensible stuff. I would point that this is the kind of functionality that a datalake implements -- I'm somewhat wary of reinventing the wheel here, even though it may be fun to roll our own stuff.

sjperkins commented 1 year ago

One strategy that occurred to me was to introduce a new MS indexing column (PROVENANCE_ID or repurposing an existing column) storing an integer for each row. Each entry would index a list of previous on-disk datasets, stored in a separate PROVENANCE subtable or in table/column metadata.

/cc @JSKenyon @landmanbester @bennahugo

sjperkins commented 1 year ago

I also think that averaging should result in the discarding previous provenance information, because a one-to-one mapping no longer exists.

sjperkins commented 1 year ago

Posting @o-smirnov's chat discussion here:

From a user's PoV, the simplest thing for flagversions would be:

  1. QuartiCal writes to the FLAG column (of the current delta-set)
  2. A command-line tool is used to say "I would like to save this version of FLAG content under the name 'foo'"
  3. The next time QuartiCal writes to FLAG, the current content of FLAG is tucked away under the label "foo" somehow, and new content is written in. So copy-on-write semantics

Then the restore operation is as simple as saying "restore FLAG:foo"

sjperkins commented 1 year ago

Here is my current thoughts for a row-granularity data provenance mechanism:

Provenance Sub-table

id name url
0 archive s3://ratt-public-data/ESO137/archive.zarr
1 1gc s3://ratt-public-data/ESO137/1gc .zarr
2 3gc s3://ratt-public-data/ESO137/1gc .zarr
3 imaging s3://ratt-public-data/ESO137/imaging .zarr

PROVENANCE_ID column

Add provenance keyword to xds_from_ and xds_to_ methods

sjperkins commented 1 year ago

I'm leaning towards ditching the PROVENANCE_ID column and making it a requirement that provenance chains must have a one-to-one mapping in terms of data sizes. Would this be too restrictive for data processing purposes?

sjperkins commented 1 year ago

I'm leaning towards ditching the PROVENANCE_ID column and making it a requirement that provenance chains must have a one-to-one mapping in terms of data sizes. Would this be too restrictive for data processing purposes?

If a new data size is produced (by averaging) for example, a completely new provenance chain, and hence, sub-table would be created.

o-smirnov commented 1 year ago

I'm leaning towards ditching the PROVENANCE_ID column and making it a requirement that provenance chains must have a one-to-one mapping in terms of data sizes. Would this be too restrictive for data processing purposes?

Yes. A very common procedure is to split out calibrators and targets (different FIELD_IDs). It would be very nice if chains supported this, making this split-out effectively a no-op (in terms of storage used).

If a new data size is produced (by averaging) for example, a completely new provenance chain, and hence, sub-table would be created.

I think averaging makes a completely new dataset anyway, no, so there's no chain to speak of?

o-smirnov commented 1 year ago
  • One of my concerns with this approach is that creating dask arrays from many partial sources can end up being messy and interfere with chunking mechanisms.

Yes I see that concern. My current feeling is that a per-row PROVENANCE_ID is certainly the most generic way, but is also overkill for most applications.

I think what would be sufficient would be for a delta-column to represent a fixed (row-based) subset of the parent column. Can we do it without full-on row-level granularity somehow? Let's keep on thinking...

sjperkins commented 1 year ago

Yes. A very common procedure is to split out calibrators and targets (different FIELD_IDs). It would be very nice if chains supported this, making this split-out effectively a no-op (in terms of storage used).

Yes, we'd need data subsets you describe below to support the above case, because the split data would be derived from completely different rows.

I think what would be sufficient would be for a delta-column to represent a fixed (row-based) subset of the parent column. Can we do it without full-on row-level granularity somehow? Let's keep on thinking...

Will do. In fact, am doing some reading under the wikpedia Data Lineage which seems to fall squarely into what we're planning.

sjperkins commented 1 year ago

https://discourse.pangeo.io/t/tracking-provenance-in-xarray/1510

sjperkins commented 1 year ago
o-smirnov commented 1 year ago

If performance in pathological edge cases is the only issue, then I wouldn't worry about it too much...

sjperkins commented 1 year ago

TileDB supports dataset versioning and time travelling natively.

https://docs.tiledb.com/main/background/key-concepts-and-data-format

sjperkins commented 1 year ago

If performance in pathological edge cases is the only issue, then I wouldn't worry about it too much...

Agreed that this is unlikely