octue / octue-sdk-python

The python SDK for @Octue services and digital twins.
https://octue.com
Other
9 stars 4 forks source link

Should we disallow cloud files in local datasets (and vice versa)? #412

Open cortadocodes opened 2 years ago

cortadocodes commented 2 years ago

We're arriving at a clearer distinction of what local and cloud datasets are:

The files in both types of dataset can have local/cloud duality but the following restrictions apply:

Should we enforce these restrictions or just advise them?

Originally posted by @cortadocodes in https://github.com/octue/octue-sdk-python/issues/364#issuecomment-1082099381

thclark commented 2 years ago

Just dropping by with an interesting case, where I added cloud datafiles to a local dataset. In this case it pretty much worked like a charm, although I felt that actually what should have happened was a creation of a new instance of Datafile on add() to the dataset... because things like exists_in_cloud were still set on the datafile after its addition.

It wasn't instinctive to do any of this, took a lot of debugging to understand that I could do this. So a more explicit pattern might be helpful.

import logging

from octue.resources import Dataset

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

# Complete data lakes
all_elevation_maps = Dataset(path="gs://lake-elevation-maps", recursive=True)
all_mast_timeseries = Dataset(path="gs://lake-mast-timeseries", recursive=True)
all_wind_maps = Dataset(path="gs://lake-wind-maps", recursive=True)

# Fixture files
fixture_elevation_files = [all_elevation_maps.files.one(id__contains="5229e870")]

fixture_mast_timeseries_files = [
    all_mast_timeseries.files.one(id__contains="e6afc3ea"),  # m1
    all_mast_timeseries.files.one(id__contains="739c4fdc"),  # m2
    all_mast_timeseries.files.one(id__contains="2a37e57e"),  # m3
    all_mast_timeseries.files.one(id__contains="1dbed715"),  # m4
    all_mast_timeseries.files.one(id__contains="0b216b8c"),  # lidar
]

fixture_wind_map_files = [
    all_wind_maps.files.one(id__contains="c9823f65"),  # 149m
    all_wind_maps.files.one(id__contains="2a77b636"),
    all_wind_maps.files.one(id__contains="47e16290"),
]

# Create fixture datasets
sets = {
    "tests/data/hills_of_gold/elevation_maps": fixture_elevation_files,
    "tests/data/hills_of_gold/mast_timeseries": fixture_mast_timeseries_files,
    "tests/data/hills_of_gold/wind_speed_maps": fixture_wind_map_files,
}
for path, files in sets.items():
    ds = Dataset(path=path)
    ds.update_metadata()
    for file in files:
        ds.add(file)
    for file in ds:
        file.update_local_metadata()