Adding a `dc.save()` feature

omad commented 6 years ago

Use Case

When experimenting with data loading from a Data Cube users need to be able to save the xarray.Dataset objects back into an Index, for use in future analysis.

Interactive use with small amounts of data which fit in RAM.
Requires write access to a Data Cube Index.
The Product may or may not already exist.
No support for including lineage/sources for derived datasets.

Starting Point

@petewa has an initial implementation available in the csiro/execution-engine branch with an API that looks like:


    ds = DatacubeSave(dc)
    ds.save(nbar, 'my_bucket', 's3aio', 'dcsave_mydata', 'eo',
            chunking = {'time': 1, 'x': 3, 'y': 3})
    ds.save(nbar, '/home/ubuntu/data/output', 's3aio_test', 'dcsave_mydata', 'eo',
            chunking = {'time': 1, 'x': 3, 'y': 3})
    ds.save(nbar, '/home/ubuntu/data/output', 'NetCDF CF', 'dcsave_mydata', 'eo',
            chunking = {'time': 1, 'x': 4, 'y': 4})

This is a good starting point for implementing a simple save function.

In discussion with @Kirill888 there's a few additions and changes we would propose:

Include the save() function directly on the Datacube object.
Split the API so that create_product is a separate step to save
- Simplifies the function definition, and can raise errors earlier.
Add more rigorous error checking
- eg. does the dataset match the product?
More test cases

Potential Problems

There can be problems with the auto-matching of Products, which would become far more likely with users creating many more products. Will raise another issue about this, but it should be fixed before releasing a save feature.

Kirill888 commented 6 years ago

Other potential area of concern is dealing with "lazy datasets", i.e. dask arrays. Potentially some kind of tiling might be required, essentially same thing ingest does. Once this feature is available people will want to apply it to entire DB, we will need to make it easy. Put actual "workflow" into GridWorkflow class. Doesn't mean we have to address right away, but it will be requested next I can guarantee it.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

opendatacube / datacube-core