opendatacube / odc-stac

Load STAC items into xarray Datasets.
Apache License 2.0
140 stars 20 forks source link

.load() only loading overlap between scenes with HLS data? #177

Closed ithaca-oyler closed 1 month ago

ithaca-oyler commented 1 month ago

I'm currently using HLS data (both HLSL30.v2.0 and HLSS30.v2.0 collections) for a short time frame (5-16-2024 through 5-18-2024) and I noticed that only areas that have an overlap of both collections still have imagery in the final output. I'm at a bit of a loss as to why this may be occurring, but I suspect it may have to do with the way the lazy xarray dataset is built. I thought setting the nodata value might help (as mentioned in https://github.com/opendatacube/odc-stac/issues/162), but the output did not change. The HLS config contains information about band names, band values, cloud masking information, etc.

I am currently using odc-geo==0.4.6 and odc-stac==0.3.9.

This image shows what we expect to see: image

And this image shows what is loaded in with data = stac.stac_load(item_collection, geopolygon=roi, chunks={"x": "auto", "y": "auto","time":"auto"}, output_crs=crs, dtype='int16', resolution=30, stac_cfg=hls_cfg )

image

I know there are a lot of moving parts in this issue, but any insights as to why something like this might be happening will be very helpful :) Thank you!

Kirill888 commented 1 month ago

@ithaca-oyler please provide replication code, what stac query you are running, how are you calling stac load etc. This certainly looks like broken nodata marker, do the data files have expected nodata field set? By default we trust nodata marker in the file over all other places. Do you get same problem when running without Dask?

ithaca-oyler commented 1 month ago

Hi @Kirill888 -- attached is a zip file that contains a sample notebook, the config file, and an example of what one of the metadata files look like. According to the metadata, the data files expect to have nodata set to -9999 (listed as FILLVALUE in the metadata). The output images shown in my previous comment are the outputs from odc.algo xr_quantile at the end of the notebook. I have tried several different combinations of setting the nodata value in the config file, as a stac.load() call, and in xr_quantile() with no good results.

replication-code.zip

Kirill888 commented 1 month ago

@ithaca-oyler the code does way too much and requires complex environment to run. Please provide a simpler example using single band only. My guess is that data bands work probably fine, but Fmask band is not, need to specify nodata=255 for it. Please debug more on your end.

ithaca-oyler commented 1 month ago

@Kirill888 Thank you for your patience while I put together a simpler example for you. I'm taking out the cloud masking step, but still making the change you suggested with specifying nodata=255 in the HLS config I was using. I hope to have an updated notebook made in the next day or so. In the meantime, I have done some additional debugging and I think you're on the right track with part of the issue being in the Fmask band. Fixing some of the HLS config helped and brought back some of the pieces of data that are between the HLS tiles.

Here's what specifying nodata=255 and removing the cloud masking step resulted in: image

I also decided to experiment with the chunks setting in stac.load() and I was surprised to find changing up this setting produced different results. For the image above, the setting was chunks={"x": "auto", "y": "auto","time":"auto}. Here, I changed it to chunks={"x": 512, "y": 512,"time":"auto"}

image

And here I changed it to chunks={"x": 1024, "y": 1024,"time":"auto"} image

Do you have any thoughts as to why the output might be changing with the chunk sizes?

ithaca-oyler commented 1 month ago

Hi @Kirill888 -- a few updates. I've done a bit more debugging and I was able to determine that there were several issues contributing to my results, namely cloud masking, the no data flags I had set in my HLS config file, and needing to select the correct groupby argument in .load(). Below is the result of nodata=0 in my config file and data = stac.load(item_collection, geopolygon=roi, chunks={"x": "auto", "y": "auto","time":"auto"}, output_crs=crs, resolution=30, stac_cfg=cfg, nodata=np.nan, dtype='float64', groupby='solar_day') for the same subset of 3 days shown in previous comments (no cloud masking): image

The fragmentation shown in my original question occurs with a combination of the nodata value in my HLS config not being set, the xr_quantile() nodata value being set to 0, and groupby='time' or groupby not being set at all. Lots of moving parts and places where things could go wrong.

The end results output from xr_quantile() still aren't quite looking how I expect, but I think I've ruled out .load() as the source of my problems given how the rgb time slices look. I'll hop over to the odc-algo repo to post my question about xr_quantile() there.

Thanks a bunch!