opendatacube / datacube-ows

Open Data Cube Open Web Services
Other
69 stars 37 forks source link

XML error results are returned with 200 status code #1000

Closed whatnick closed 6 months ago

whatnick commented 6 months ago

Description

Sometimes flaky layers exist that fail in certain pods and succeed in others when running under Kubernetes, this is particularly the case for on-the fly time composites that load multiple datasets over time. The failed pods return XML error with a 200 status code intermittently which causes clients such as TerriaJS to completely disable the layer instead of retrying another HTTP request on a 5xx response.

Steps to Reproduce

Trial a 45-day Landsat 8/9 Composite layer hosted on CSIRO ADIAS + Terria at this link

The layer will disable itself automatically due to the OWS behaviour described above.

Context (Environment)

datacube-ows version (datacube-ows --version): 0.1.34

ows_config.py file (link, sample code)

{ "title": "Landsat 8/9 Collection 2 Level-2 SR Latest Mosaic", "name": "landsat_8_9_c2l2_sr_latest", "abstract": """ Latest mosaic of Landsat 8/9 Surface Reflectance """, "multi_product": True, "product_names": ["landsat9_c2l2_sr", "landsat8_c2l2_sr"], "bands": bands_landsat8_c2l2_sr, "resource_limits": reslim_standard, "native_crs": "EPSG:3577", "native_resolution": [25, -25], "image_processing": { "extent_mask_func": "datacube_ows.ogc_utils.mask_by_val", "always_fetch_bands": [], "manual_merge": False, }, "flags": [ { "band": "qa_pixel", }, { "band": "qa_radsat", }, ], "time_axis": { "time_interval": 1 }, "styling": { "default_style": "true_colour", "styles": styles_landsat8_c2l2_sr }, "mosaic_date_func": { "function": "ows_easi.ows_util_tools.rolling_window_ndays", "pass_layer_cfg": True, "kwargs": { "ndays": 45, } } },

The standard rolling window n-days actually errors on this layer and custom function is used to set start/end days without geometry lookup.

def rolling_window_ndays( available_dates: Sequence[datetime.datetime], layer_cfg: "datacube_ows.ows_configuration.OWSNamedLayer", ndays: int = 6) -> Tuple[datetime.datetime, datetime.datetime]: idx = -ndays days = available_dates[idx:] start = days[idx] end = days[-1] _LOG.info(f"Selected start: {start} and end: {end}") return (start, end)

SpacemanPaul commented 6 months ago

Agree error messages shouldn't be returned with 200. I can't see any obvious issues with the internally generated error responses so I suspect it must be some kind of uncaught exception breaking out. Could you please attach the XML error message associated with a bad response? It may help reproduce and track down the issue.

I can see the error messages from Terria, but I can't see anything unusual in the browser dev tools network traffic log - everything shows up as either OK + image or DNS error.

whatnick commented 6 months ago

Here is a sample error

<ServiceExceptionReport xmlns="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.3.0" xsi:schemaLocation="http://www.opengis.net/ogc http://schemas.opengis.net/wms/1.3.0/exceptions_1_3_0.xsd">
<ServiceException> Unexpected server error: The specified key does not exist. </ServiceException>
<ServiceException>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ogc.py, line 152 in ogc_svc_impl> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/wms.py, line 29 in handle_wms> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/data.py, line 508 in get_map> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/data.py, line 253 in data> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/data.py, line 341 in read_data> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube/api/core.py, line 801 in load_data> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube/api/core.py, line 729 in _xr_load> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube/api/core.py, line 952 in _fuse_measurement> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube/storage/_load.py, line 89 in reproject_and_fuse> ]]>
<![CDATA[ <FrameSummary file /usr/lib/python3.10/contextlib.py, line 135 in __enter__> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube/storage/_rio.py, line 146 in open> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube/storage/_rio.py, line 117 in open> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/rasterio/env.py, line 451 in wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/rasterio/__init__.py, line 304 in open> ]]>
<![CDATA[ <FrameSummary file rasterio/_base.pyx, line 310 in rasterio._base.DatasetBase.__init__> ]]>
</ServiceException>
</ServiceExceptionReport>

It can be reproduced https://ows.adias.aquawatchaus.space/?transparent=true&format=image%2Fpng&exceptions=XML&styles=ndti&version=1.3.0&service=WMS&request=GetMap&layers=landsat_8_9_c2l2_sr_latest&bbox=16163068.253070232%2C-2113330.958028555%2C16202204.01155224%2C-2074195.199546542&width=256&height=256&crs=EPSG%3A3857](https://ows.adias.aquawatchaus.space/?transparent=true&format=image%2Fpng&exceptions=XML&styles=ndti&version=1.3.0&service=WMS&request=GetMap&layers=landsat_8_9_c2l2_sr_latest&bbox=16163068.253070232%2C-2113330.958028555%2C16202204.01155224%2C-2074195.199546542&width=256&height=256&crs=EPSG%3A3857

SpacemanPaul commented 6 months ago

The example above is coming back with the appropriate response code though - 500. Do you want to change the description for this issue?

SpacemanPaul commented 6 months ago

Have you got an Explorer instance up for this data? Is there a reason you have the rolling window width set to 45 days? That's multiple Landsat revisits, especially with both LS8 and LS9 in play.

whatnick commented 6 months ago

Here is the relevant explorer : https://explorer.adias.aquawatchaus.space/products

Using the stock rolling_window_ndays produces an error

<ServiceExceptionReport xmlns="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.3.0" xsi:schemaLocation="http://www.opengis.net/ogc http://schemas.opengis.net/wms/1.3.0/exceptions_1_3_0.xsd">
<ServiceException> Unexpected server error: list index out of range </ServiceException>
<ServiceException>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ogc.py, line 152 in ogc_svc_impl> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/wms.py, line 29 in handle_wms> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/data.py, line 446 in get_map> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/utils.py, line 29 in log_wrapper> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/data.py, line 147 in __init__> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ogc_utils.py, line 378 in __call__> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ogc_utils.py, line 473 in rolling_window_ndays> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ows_configuration.py, line 922 in search_times> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ows_configuration.py, line 358 in search_times> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ogc_utils.py, line 119 in local_solar_date_range> ]]>
<![CDATA[ <FrameSummary file /usr/local/lib/python3.10/dist-packages/datacube_ows/ogc_utils.py, line 185 in tz_for_geometry> ]]>
</ServiceException>
</ServiceExceptionReport>
SpacemanPaul commented 6 months ago

I think it might be an indexing issue - can I see the style definition for ndti as well?

SpacemanPaul commented 6 months ago

I think you might have a dataset somewhere with a bad extent geometry. Any chance you could work out what dataset(s) cover that bounding box and rolling window and take a look?

whatnick commented 6 months ago

NDTI is a normalized difference index, similar to NDVI with red/green as inputs the bad extent is a more likely culprit in this case. It would be good to patch OWS to be more resilient to this, but that is a different / unrelated issue to XML response code, I will keep investigating the index.

SpacemanPaul commented 6 months ago

But multi-date mosaics have never been used by DEA or DEAfrica and so are fairly poorly tested, so I'm not ruling out an OWS issue.

If we can identify the problem dataset/datasets and what exactly is causing the problem it might be possible to make OWS more resilient, but I'm just making educated guesses about what the problem is at this stage. (Note that "being more resilient to indexing issues like this" might involve always returning blank tiles for that dataset or raising an error at the datacube-ows-update stage instead of at runtime.)

whatnick commented 6 months ago

Looks like the read_data method is currently untested we could pass in skip_broken_datasets=True in the Datacube.load_data call and skip the datasets that are in the index, but note readable from source. Examination of the broken dataset revealed that it has wrong S3 permissions at USGS end preventing it from being read.

I will also note that read_data failure is currently untested , we may throw in some broken datasets to reproduce the error.

SpacemanPaul commented 6 months ago

read_data is tested by the WMS end-to-end integration tests, but doesn't have thorough coverage, no.