opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
505 stars 176 forks source link

Dataset.measurements is a list of dicts not a dict #954

Closed kieranricardo closed 1 year ago

kieranricardo commented 4 years ago

Hi there! While trying to index some data got an error where datacube is calling keys on a list. It looks like the issue is due to Dataset.metadata_doc['measurements'] storing a list of dicts in my case, which gets stored in Dataset.measurements causing an error when check_dataset_consistent trys to check the measurements.

The fix seems fairly straightforward, just changing the code in Dataset.measurements to:

    @property
    def measurements(self) -> Dict[str, Any]:
        # It's an optional field in documents.
        # Dictionary of key -> measurement descriptor
        if not hasattr(self.metadata, 'measurements'):
            return {}
        return self.metadata.measurements[0]

Worked for me. My example only has one measurement so there might be extra work needed to handle multiple measurements. Or maybe my product document is invalid and this syntax error could be caught earlier on. I'm pretty keen to open a PR if a fix is appropritate!

Expected behaviour

Dataset.measurements stores a dictionary and datacube dataset add ... works.

Actual behaviour

Traceback (most recent call last):
  File "/Users/kieranricardo/anaconda3/envs/odc/bin/datacube", line 10, in <module>
    sys.exit(cli())
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/ui/click.py", line 197, in new_func
    return f(parsed_config, *args, **kwargs)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/ui/click.py", line 229, in with_index
    return f(index, *args, **kwargs)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 178, in index_cmd
    run_it(pp)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 173, in run_it
    dry_run=dry_run)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 184, in index_datasets
    for dataset in dss:
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/scripts/dataset.py", line 49, in dataset_stream
    dataset, err = ds_resolve(ds, uri)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/index/hl.py", line 277, in __call__
    is_consistent, reason = check_dataset_consistent(dataset)
  File "/Users/kieranricardo/anaconda3/envs/odc/lib/python3.6/site-packages/datacube/index/hl.py", line 103, in check_dataset_consistent
    if not product_measurements.issubset(dataset.measurements.keys()):
AttributeError: 'list' object has no attribute 'keys'

Steps to reproduce the behaviour

Using my product and dataset files locally I can reproduce this error by:

Environment information

Extra info:

Edit: Updated with more accurate info

Kirill888 commented 4 years ago

@kieranricardo can you please provide a sample of your dataset and product yamls?

What is most likely happening is that your dataset yaml has measurements defined as a list, while datacube expects a dictionary from band name to a dict describing band files. It is is a bit confusing, since product definition needs to have measurements defined as list 🤦.

Unfortunately dataset add doesn't perform yaml document validation to a sufficient degree to report this error at index time, so you get error at run-time instead.

https://datacube-core.readthedocs.io/en/latest/ops/dataset_documents.html https://datacube-core.readthedocs.io/en/latest/ops/product.html

kieranricardo commented 4 years ago

@Kirill888 thanks for the speedy reply! Ah yep I'm using lists in both my product and dataset yamls 🤦 I was following the first example here: https://datacube-core.readthedocs.io/en/latest/ops/indexing.html

Here's my product yml for reference:

name: prock
description: Outer Darwin Harbour Marine Survey 2015 p-rock (probability of rock) grid
metadata_type: eo3
license: Creative

metadata:
    format:
        name: GeoTIFF

measurements:
    - name: prock
      dtype: uint8
      nodata: NaN
      units: 'probability'

And my dataset yaml:

# UUID of the dataset
id: f884df9b-4458-47fd-a9d2-1a52a2db8a1a
$schema: 'https://schemas.opendatacube.org/dataset'

# Product name
product:
  name: prock

format:
  name: GeoTIFF

crs: "epsg:32752"
grids:
    default:
       shape: [5216, 8827]
       transform: [6.64741850989687, 0.0, 661456.1121158252, 0.0, -6.647669545339101, 8659755.72833947, 0.0, 0.0, 1.0]

measurements:
   - prock:
      grid: "default"
      path: "prock6.tif"

   - dummy:
       grid: "default"
       path: "prock6.tif"

# Timestamp is the only compulsory field here
properties:
  # ODC specific "extensions"
  odc:processing_datetime: 2020-02-02T08:10:00.000Z

# Lineage only references UUIDs of direct source datasets
# Mapping name:str -> [UUID]
lineage: {}  # set to empty object if no lineage is defined
kieranricardo commented 4 years ago

@Kirill888 is there a nice way to delete product definitions? so far I've resorted to just manually deleting rows form postgres

Kirill888 commented 4 years ago

@kieranricardo changing data in place is a bit of a sore point in datacube, not really well supported. The DB layer basically assumes append only operations or "approximately append only".

You CAN modify products and dataset documents in place, but you need to supply extra command line flags to allow "unsafe changes". There is no delete functionality for anything, there is dataset archive, but it's not what you want in this case.

Some of those limitations come from the lineage tracking functionality, deleting dataset that is referenced by derived dataset should not be allowed, but we should allow deletion of datasets that are not referenced by anyone, currently not implemented though.

kieranricardo commented 4 years ago

Thanks, the "unsafe changes" is what I was looking for! Although it would be nice to be able to safely update a product/dataset inplace if it isn't being referenced. Would you be open to PRs implementing these?

Kirill888 commented 4 years ago

I can't say I fully comprehend where the boundary between safe and unsafe changes lie for product definition and for dataset documents. I also suspect that the boundary depends on the context that can not be captured by the database itself. For example, if you are still in the "bootstrapping stages" and haven't started using the database, any change that maintains database consistency rules should be OK. If however this is a large installation with long history of use, then situation is very different.

I believe current definition of "unsafe", as captured by the implementation (can't really point to any documents on that) applies more to the second case and errors on the side of caution. So, you probably should not worry about "unsafe" changes too much.

Having said that, PRs are welcome. In particular tooling for "undo" operations, that are so handy in early development stages, but are missing. Things like "delete dataset", "delete datasets that match certain criteria", "delete product and all its datasets". Those are relatively straightforward from SQL side of things, but there might be complications due to db abstraction layer in the datacube-core.

Probably easiest is to start at "SQL layer", i.e. assuming that db structure is fixed (it kinda is) and start from there. For this kind of work I recommend doing it here: https://github.com/opendatacube/odc-tools/tree/master/libs/index/odc/index rather than in this datacube-core repo itself.

Kirill888 commented 4 years ago

@kieranricardo by the way you should be using just datetime: ... and not odc:processing_datetime: ... to specify timestamp, the later is for "dataset generation time", but what you really need to supply to datacube is "what time were pixels captured at", and that goes into datetime key, or if it's a time range, it goes into dtr:start_datetime: and dtr:end_datetime:

kieranricardo commented 4 years ago

thanks for your help @Kirill888. I'll make an issue (if there isn't one already) and PR for some delete tooling in https://github.com/opendatacube/odc-tools/tree/master/libs/index/odc/index.

One other question, I'm having trouble with specify a lineage in eo3 format. There's examples in the docs for eo lineage but I couldn't find any for eo3. I'm trying to add the dataset document i shared to the lineage of another dataset like so:

lineage: {"parent": ["f884df9b-4458-47fd-a9d2-1a52a2db8a1a"]}

But i get:

ERROR Inconsistent lineage dataset f884df9b-4458-47fd-a9d2-1a52a2db8a1a
> $schema: missing!='https://schemas.opendatacube.org/dataset', crs: missing!='epsg:32752', extent: missing!={'lat': {'end': -12.116431349076615, 'begin': -12.433299809465927}, 'lon': {'end': 131.02506209614575, 'begin': 130.48367254613058}}, format: missing!={'name': 'GeoTIFF'}, grid_spatial: missing!={'projection': {'geo_ref_points': {'ll': {'x': 661456.1121158252, 'y': 8625081.483990982}, 'lr': {'x': 720132.8753026848, 'y': 8625081.483990982}, 'ul': {'x': 661456.1121158252, 'y': 8659755.72833947}, 'ur': {'x': 720132.8753026848, 'y': 8659755.72833947}}, 'spatial_reference': 'epsg:32752'}}, grids: missing!={'default': {'shape': [5216, 8827], 'transform': [6.64741850989687, 0.0, 661456.1121158252, 0.0, -6.647669545339101, 8659755.72833947, 0.0, 0.0, 1.0]}}, measurements: missing!={'prock': {'grid': 'default', 'path': 'prock6.tif'}}, product: missing!={'name': 'prock'}, properties: missing!={'odc:processing_datetime': '2020-02-02T08:10:00.000Z'}

Do you know what's going on here?

Kirill888 commented 4 years ago

@kieranricardo you doing right thing with respect to lineage. The code should be smarter when dealing with EO3 though. You need to use datacube dataset add --no-verify-lineage ... when indexing EO3, essentially since EO3 doesn't include lineage document in the derived document yaml, there is nothing to verify.

We should update docs, or better auto skip verification step when source dataset is EO3 in code.

EO3 is a kinda bolt-on intermediate step, and a very recent addition, so...

kieranricardo commented 4 years ago

better auto skip verification step when source dataset is EO3 in code

this would be nice! I'll make an issue for this

kieranricardo commented 4 years ago

filed https://github.com/opendatacube/datacube-core/issues/956

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

omad commented 1 year ago

We think this has been resolved.