opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
514 stars 177 forks source link

Dask load returning all nan values #1479

Closed JackLidge closed 2 months ago

JackLidge commented 1 year ago

Expected behaviour

I'm trying to load in a long timeseries with dask to compute the mean values for the area over time. Here's a code trace for loading in the data and computing the mean without using dask:

dc = datacube.Datacube()
query = {'x': [-7.43862660215137, -4.377843103769999],
 'y': [54.02144007175762, 56.83505068441397],
 'time': ['2017-01-01', '2017-02-01'],
 'output_crs': 'EPSG:32630',
 'resolution': [100, 100],
 'group_by': 'solar_day',
 #'dask_chunks': {'time': 1}
}
rainfall = dc.load(product='cmorph_rainfall', **query)
rainfall_list = rainfall.rainfall.mean(dim=['x', 'y']).values

and rainfall_list, returns:

array([1.2073916e-03, 0.0000000e+00, 1.3340176e+00, 0.0000000e+00,
       6.5343648e-01, 1.4796243e+01, 5.5638492e-01, 3.7439067e+00,
       7.6677237e+00, 3.8516700e+00, 5.1759934e-01, 1.3966684e-01,
       5.4357775e-02, 8.6581379e-01, 4.9658670e+00, 1.5595733e+00,
       1.6349301e-01, 1.8843752e-01, 1.0391021e-04, 8.2616024e-02,
       1.3581085e-01, 1.2322235e-01, 1.0261511e-04, 2.4435787e+00,
       3.7526662e+00, 0.0000000e+00, 4.2349246e-01, 3.7233862e-01,
       9.1776066e-02, 3.3019252e+00, 1.0951433e+01, 1.4380445e+00],
      dtype=float32)

Actual behaviour

When I load the data using dask and then running compute, it returns all nans, for both rainfall_list and then just selecting one individual timestep.

dc = datacube.Datacube()
query = {'x': [-7.43862660215137, -4.377843103769999],
 'y': [54.02144007175762, 56.83505068441397],
 'time': ['2017-01-01', '2017-02-01'],
 'output_crs': 'EPSG:32630',
 'resolution': [100, 100],
 'group_by': 'solar_day',
 'dask_chunks': {'time': 1}
}
rainfall = dc.load(product='cmorph_rainfall', **query)
rainfall_list = rainfall.rainfall.mean(dim=['x', 'y']).values

this returns:

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan], dtype=float32)

Steps to reproduce the behaviour

I've loaded a Sentinel-2 image for the same area using dask and time period and that loads correctly, so either I am doing something wrong when using dask to calculate a 2D mean or the images themselves are registered weirdly.

The product yaml for cmorph_rainfall is:

name: cmorph_rainfall
description: NOAA Climate Data Record (CDR) of CPC Morphing Technique (CMORPH) High Resolution Global Precipitation Estimates, Version 1. From https://www.ncei.noaa.gov/access/metadata>
metadata_type: eo3

metadata:
  product:
    name: cmorph_rainfall

measurements:
  - name: rainfall
    dtype: float32
    nodata: NaN
    units: "degree"

and one of the dataset yamls looks as follows:

---
# Dataset
$schema: https://schemas.opendatacube.org/dataset
id: 1b668dbc-89f0-4b8f-bf0c-e3a16fd43dcb

label: cmorph_rainfall_2022-01-01
product:
  name: cmorph_rainfall

crs: epsg:4326
geometry:
  type: Polygon
  coordinates: [[[-180.0, 60.0], [180.0, 60.0], [180.0, -60.0], [-180.0, -60.0], [
        -180.0, 60.0]]]
grids:
  default:
    shape: [480, 1440]
    transform: [0.25, 0.0, -180.0, 0.0, 0.25, -60.0, 0.0, 0.0, 1.0]

properties:
  datetime: 2022-01-01 00:00:00Z
  odc:processing_datetime: 2023-03-28 14:09:45.581911Z
  odc:product_family: cmorph_rainfall

measurements:
  rainfall:
    path: cmorph_0.25_daily_20220101.tif

accessories: {}

lineage: {}
...

I used eodatasets3 to generate the yamls (both data and yamls are in the same s3 bucket), with the following script being used to create the yamls.

with DatasetPrepare(metadata_path=Path(local_yamlpath)) as p:
            p.product_family = 'cmorph_rainfall'
            p.datetime = start_date.strftime('%Y%m%dT%H%M%S')
            p.processed_now()

            p.note_measurement('rainfall', Path(local_tifpath))
            p.done()

Environment information

datacube version: 1.8.12

Any pointers on this greatly appreciated!

Note: Stale issues will be automatically closed after a period of six months with no activity. To ensure critical issues are not closed, tag them with the Github pinned tag. If you are a community member and not a maintainer please escalate this issue to maintainers via GIS StackExchange or Slack.

Kirill888 commented 1 year ago

probably due to this issue: #957

SpacemanPaul commented 1 year ago

I agree with Kirill. Datasets that span -180 to 180 longitude in native coordinates are problematic. You can hack around it by manually changing to something like -179.99999 to 180. It's a bit ugly, but it works.

JackLidge commented 1 year ago

I'll give it a go with that hack!