opendatacube / odc-stac

Load STAC items into xarray Datasets.
Apache License 2.0
140 stars 20 forks source link

Per-band dtype specification not working #155

Closed cordmaur closed 5 months ago

cordmaur commented 5 months ago

In the documentation of the load function, it says that the dtype can be specified per band.

    :param dtype:
       Force output dtype, can be specified per band

According to the docstring: dtype: Union[DTypeLike, Dict[str, DTypeLike], None] = None,, I'm creating a dtype like:

{'AOT': 'uint16', 'B01': 'uint16', ...., 'SCL': 'uint8'}

However I'm getting an error: ValueError: entry not a 2- or 3- tuple

Investigating the code, it seems the error is in the call of _resolve_chunk_shape(len(tss), gbox, chunks, dtype). The normalize_chunks get just one dtype and it tries to cast it as a np.dtype. That's where the error is being raised.

I could workaround this issue by hard-coding just one dtype for this _resolve_chunk_shape function (fig below), but I don't know what this "chunk dtype" is meant to be to make final PR solution.

image

My odc.stac is version 0.3.9.

Kirill888 commented 5 months ago

@cordmaur thanks for the report, can you please provide a full error trace and calling code.

Kirill888 commented 5 months ago

Alright so problem is here:

https://github.com/opendatacube/odc-stac/blob/69bdf64a36f95d346742166e9861b852c9b23e63/odc/stac/_stac_load.py#L414-L422

resolve_chunk_shape should be called with the largest dtype across all bands, not with dtype coming from the user configuration that might be per-band or not even present. One can also use dtype=None, it's only used by Dask to resolve "auto" chunks.

There is currently a major refactor on the way, not sure if this will be addressed before that merges.

Kirill888 commented 5 months ago

in the meantime you can use stac_cfg= to patch data type information missing from the stac source, something like

sentinel-2-l2a:  #< or whatever collection you are loading
    assets:
      "*": {data_type: uint16, nodata: 0}
      SCL: {data_type: uint8, nodata: 0}
      visual: {data_type: uint8, nodata: 0}

but as a python dict, not yaml string.

This is a mechanism to patch missing raster extension metadata: https://github.com/stac-extensions/raster

cordmaur commented 5 months ago

@Kirill888 , thank you for the clarification.

The suggested snippet solved the problem:

stac_cfg = {
    "sentinel-2-l2a": {
        "assets": {
            "*": {"data_type": "uint16", "nodata": 0},
            "SCL": {"data_type": "uint8", "nodata": 0},
            "visual": {"data_type": "uint8", "nodata": 0},
        }
    }
}

I'm closing the issue.