gjoseph92 / stackstac

Turn a STAC catalog into a dask-based xarray
https://stackstac.readthedocs.io
MIT License
238 stars 49 forks source link

Incorrect bounds and shape when stacking 3DEP items from Planetary Computer #147

Closed aazuspan closed 2 years ago

aazuspan commented 2 years ago

Hi @gjoseph92, I'm using stackstac.stack with two adjacent COGs from the Planetary Computer 3dep-seamless collection, and the output bounds and shape are incorrect. If I load them manually with xarray and concat them, the bounds are correct.

Reproducing

import planetary_computer
from pystac_client import Client
import stackstac
import xarray as xr

catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
collection = "3dep-seamless"
# This bbox spans two items that are adjacent in x dimension
bbox = [-124.210979, 43.336502, -123.657496, 43.798309]

search = catalog.search(
    collections=[collection],
    bbox=bbox,
    # 3DEP contains both 10m and 30m items
    filter={"eq": [{"property": "gsd"}, 30]},
)

signed = planetary_computer.sign(search)

# Stack the two adjacent items with stackstac (doesn't work as expected)
da_stackstac = stackstac.stack(signed, assets=["data"])

# Stack the two items manually by opening and concatenating with xarray (works as expected)
da_list = [xr.open_rasterio(item.assets["data"].href, chunks=1024) for item in signed]
da_xarray = xr.concat(da_list, dim="time")

Issues

The bounds for the two arrays are much different and the stackstac version doesn't cover the the search bbox:

# Bounds don't cover the bbox: (-125.00167, 43.965540000000004, -123.96554, 44.001670000000004)
print(stackstac.array_bounds(da_stackstac))

# Bounds do cover the bbox: (-125.00152777777778, 42.99819444444364, -122.99819444444364, 44.001527777777774)
print(stackstac.array_bounds(da_xarray))

The array shapes (ignoring the time and band dims) are also much different:

# (3613, 103613)
print(da_stackstac.shape[2:])

# (3612, 7224)
print(da_xarray.shape[2:])

Checking the proj:shape property for the items indicates their shape should be 3612 x 3612, so it looks like the stackstac version is off by 1 in the y dimension and 100k in the x dimension.

Any thoughts on what's going on here? Thanks!

For reference, I'm running stackstac=0.4.1 and xarray=2022.3.0.

EDIT: This looked like it might be related to #132, but the proj:transform of the items seem to be in the correct order:

>> signed[0].properties

{'gsd': 30,
 'datetime': '2013-01-01T00:00:00Z',
 'proj:epsg': 5498,
 'proj:shape': [3612, 3612],
 'end_datetime': '2013-11-01T00:00:00Z',
 'proj:transform': [1e-05,
  0.0,
  -125.0016666667,
  0.0,
  -1e-05,
  44.00166666666,
  0.0,
  0.0,
  1.0],
 'start_datetime': '1999-02-01T00:00:00Z',
 'threedep:region': 'n40w130'}
gjoseph92 commented 2 years ago

Thanks for reporting @aazuspan. I haven't looked into this yet, but I think there's enough information here to figure it out.

FYI, you might want to look at the thread linked in https://github.com/gjoseph92/stackstac/issues/112. It's reasonable to expect that stackstac's bounds may differ slightly from what you get if you open the GeoTIFFs directly, because we're working with STAC metadata, which could be inaccurate (like in #132). However, this is such a significant mismatch, I'd guess something else is going on here.

I probably won't be able to look into this for a couple weeks. Would you or anyone else (@TomAugspurger @scottyhq?) be interested in taking a look in the meantime?

aazuspan commented 2 years ago

Gotcha, thanks for the reference @gjoseph92. So stackstac array bounds and shapes are estimates based on STAC metadata and won't necessarily match file bounds exactly--makes sense. The difference in y is negligible then, but the huge x error probably indicates a) incorrect item metadata or b) correct metadata somehow being misinterpreted.

I'll take a look and see if I can track down a problem, but any insights from folks who are more familiar with stackstac or STAC are definitely welcome.

TomAugspurger commented 2 years ago

are estimates based on STAC metadata and won't necessarily match file bounds exactly

I think they should match exactly. But it's possible there's a mismatch. Are you able look at the gdalinfo of the asset and compare it to the STAC metadata?

aazuspan commented 2 years ago

Looks like there's a pixel size discrepancy causing the issue: 0.00001 according to STAC vs. 0.000277 according to GDAL.

STAC metatdata ```python {'type': 'Feature', 'stac_version': '1.0.0', 'id': 'n44w125-1', 'properties': {'gsd': 30, 'datetime': '2013-01-01T00:00:00Z', 'proj:epsg': 5498, 'proj:shape': [3612, 3612], 'end_datetime': '2013-11-01T00:00:00Z', 'proj:transform': [1e-05, 0.0, -125.0016666667, 0.0, -1e-05, 44.00166666666, 0.0, 0.0, 1.0], 'start_datetime': '1999-02-01T00:00:00Z', 'threedep:region': 'n40w130'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[-123.99833333, 42.99833333], [-123.99833333, 44.00166667], [-125.00166667, 44.00166667], [-125.00166667, 42.99833333], [-123.99833333, 42.99833333]]]}, 'links': [{'rel': 'collection', 'href': 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/3dep-seamless', 'type': 'application/json'}, {'rel': 'parent', 'href': 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/3dep-seamless', 'type': 'application/json'}, {'rel': , 'href': 'https://planetarycomputer.microsoft.com/api/stac/v1', 'type': , 'title': 'Microsoft Planetary Computer STAC API'}, {'rel': 'self', 'href': 'https://planetarycomputer.microsoft.com/api/stac/v1/collections/3dep-seamless/items/n44w125-1', 'type': 'application/geo+json'}, {'rel': 'via', 'href': 'https://ai4edataeuwest.blob.core.windows.net/3dep/Elevation/1/TIFF/n44w125/USGS_1_n44w125.xml'}, {'rel': 'preview', 'href': 'https://planetarycomputer.microsoft.com/api/data/v1/item/map?collection=3dep-seamless&item=n44w125-1', 'type': 'text/html', 'title': 'Map of item'}], 'assets': {'data': {'href': 'https://ai4edataeuwest.blob.core.windows.net/3dep/Elevation/1/TIFF/n44w125/USGS_1_n44w125.tif?st=2022-04-29T23%3A13%3A27Z&se=2022-04-30T23%3A58%3A27Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-04-30T19%3A43%3A28Z&ske=2022-05-07T19%3A43%3A28Z&sks=b&skv=2020-06-12&sig=AgVxx/WNq/thlRHe9wO3ixVxXCWnk5ZxXtMwynWPaVY%3D', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized', 'title': 'USGS 1 arc-second n44w125 1 x 1 degree', 'description': 'This tile of the 3D Elevation Program (3DEP) seamless products is 1 arc-second resolution. 3DEP data serve as the elevation layer of The National Map, and provide basic elevation information for Earth science studies and mapping applications in the United States. Scientists and resource managers use 3DEP data for global change research, hydrologic modeling, resource monitoring, mapping and visualization, and many other applications. 3DEP data compose an elevation dataset that consists of seamless layers and a high resolution layer. Each of these layers consists of the best available raster elevation data of the conterminous United States, Alaska, Hawaii, territorial islands, Mexico and Canada. 3DEP data are updated continually as new data become available. Seamless 3DEP data are derived from diverse source data that are processed to a common coordinate system and unit of vertical measure. These data are distributed in geographic coordinates in units of decimal degrees, and in conformance with the North American Datum of 1983 (NAD 83). All elevation values are in meters and, over the conterminous United States, are referenced to the North American Vertical Datum of 1988 (NAVD 88). The vertical reference will vary in other areas. Seamless 3DEP data are available nationally (except for Alaska) at resolutions of 1 arc-second (approximately 30 meters) and 1/3 arc-second (approximately 10 meters). In most of Alaska, only lower resolution source data are available. As a result, most seamless 3DEP data for Alaska are at 2 arc-second (approximately 60 meters) grid spacing. Part of Alaska is available at the 1 and 1/3 arc-second resolutions from interferometric synthetic aperture radar (ifsar) collections starting in 2010. Plans are in place for collection of statewide ifsar in Alaska. All 3DEP products are public domain.', 'roles': ['data']}, 'gpkg': {'href': 'https://ai4edataeuwest.blob.core.windows.net/3dep/Elevation/1/TIFF/n44w125/n44w125.gpkg?st=2022-04-29T23%3A13%3A27Z&se=2022-04-30T23%3A58%3A27Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-04-30T19%3A43%3A28Z&ske=2022-05-07T19%3A43%3A28Z&sks=b&skv=2020-06-12&sig=AgVxx/WNq/thlRHe9wO3ixVxXCWnk5ZxXtMwynWPaVY%3D', 'type': 'application/geopackage+sqlite3', 'roles': ['metadata']}, 'metadata': {'href': 'https://ai4edataeuwest.blob.core.windows.net/3dep/Elevation/1/TIFF/n44w125/USGS_1_n44w125.xml?st=2022-04-29T23%3A13%3A27Z&se=2022-04-30T23%3A58%3A27Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-04-30T19%3A43%3A28Z&ske=2022-05-07T19%3A43%3A28Z&sks=b&skv=2020-06-12&sig=AgVxx/WNq/thlRHe9wO3ixVxXCWnk5ZxXtMwynWPaVY%3D', 'type': 'application/xml', 'roles': ['metadata']}, 'thumbnail': {'href': 'https://ai4edataeuwest.blob.core.windows.net/3dep/Elevation/1/TIFF/n44w125/USGS_1_n44w125.jpg?st=2022-04-29T23%3A13%3A27Z&se=2022-04-30T23%3A58%3A27Z&sp=rl&sv=2020-06-12&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2022-04-30T19%3A43%3A28Z&ske=2022-05-07T19%3A43%3A28Z&sks=b&skv=2020-06-12&sig=AgVxx/WNq/thlRHe9wO3ixVxXCWnk5ZxXtMwynWPaVY%3D', 'type': 'image/jpeg', 'roles': ['thumbnail']}, 'tilejson': {'href': 'https://planetarycomputer.microsoft.com/api/data/v1/item/tilejson.json?collection=3dep-seamless&item=n44w125-1&assets=data&colormap_name=terrain&rescale=-1000,4000', 'type': 'application/json', 'title': 'TileJSON with default rendering', 'roles': ['tiles']}, 'rendered_preview': {'href': 'https://planetarycomputer.microsoft.com/api/data/v1/item/preview.png?collection=3dep-seamless&item=n44w125-1&assets=data&colormap_name=terrain&rescale=-1000,4000', 'type': 'image/png', 'title': 'Rendered preview', 'rel': 'preview', 'roles': ['overview']}}, 'bbox': [-125.0016666667, 42.99833333333, -123.9983333334, 44.00166666666], 'stac_extensions': ['https://stac-extensions.github.io/projection/v1.0.0/schema.json'], 'collection': '3dep-seamless'} ```
gdalinfo ``` Driver: GTiff/GeoTIFF Files: USGS_1_n44w125.tif Size is 3612, 3612 Coordinate System is: GEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101004, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]] Data axis to CRS axis mapping: 2,1 Origin = (-125.001666666666665,44.001666666666665) Pixel Size = (0.000277777777778,-0.000277777777778) Metadata: AREA_OR_POINT=Area Image Structure Metadata: COMPRESSION=LZW INTERLEAVE=BAND PREDICTOR=3 Corner Coordinates: Upper Left (-125.0016667, 44.0016667) (125d 0' 6.00"W, 44d 0' 6.00"N) Lower Left (-125.0016667, 42.9983333) (125d 0' 6.00"W, 42d59'54.00"N) Upper Right (-123.9983333, 44.0016667) (123d59'54.00"W, 44d 0' 6.00"N) Lower Right (-123.9983333, 42.9983333) (123d59'54.00"W, 42d59'54.00"N) Center (-124.5000000, 43.5000000) (124d30' 0.00"W, 43d30' 0.00"N) Band 1 Block=256x256 Type=Float32, ColorInterp=Gray Description = Layer_1 Min=-5.715 Max=517.677 Minimum=-5.715, Maximum=517.677, Mean=27.307, StdDev=64.316 NoData Value=-999999 Overviews: 1806x1806, 903x903, 452x452, 226x226, 113x113 Metadata: LAYER_TYPE=athematic STATISTICS_MAXIMUM=517.67651367188 STATISTICS_MEAN=27.306801648412 STATISTICS_MINIMUM=-5.7150769233704 STATISTICS_STDDEV=64.31571761273 STATISTICS_VALID_PERCENT=99.83 ```

Manually editing the proj:transform with the correct pixel size fixes the x bounds and shape being wildly off. The off-by-one discrepancy in the shape arises from geom_utils.snapped_bounds rounding min coords down and max coords up, which can be disabled with snap_bounds=False. Looks like that's as-designed, @gjoseph92?

@TomAugspurger, want me to open an issue on Planetary Computer for the 3dep STAC transform issue? It seems to affect 10m items in that collection as well.

TomAugspurger commented 2 years ago

Thanks. I opened https://github.com/stactools-packages/threedep/issues/9, which is what we used to generate the STAC metadata. We'll look into it and see what's up.

gjoseph92 commented 2 years ago

The off-by-one discrepancy in the shape arises from geom_utils.snapped_bounds rounding min coords down and max coords up, which can be disabled with snap_bounds=False. Looks like that's as-designed, @gjoseph92?

Sorry I missed this—yes, that's exactly the point of snapped_bounds. I feel like snap_bounds=True is usually what you want, curious if you agree or if that's a bad default.

So sounds like the takeaway is that STAC metadata was off, leading to unexpected results? Nothing to do on the stackstac end?

@aazuspan I'd be curious to confirm—even though the stackstac array was a wildly different shape than you expected, was the data still reasonable? I assume that if you specified resolution= 0.000277 (and maybe bounds too?), you'd get what you wanted?

aazuspan commented 2 years ago

Thanks for following up! Yeah, STAC metadata plus my confusion about snap_bounds was the problem, not stackstac. It looks like the STAC for those items hasn't been updated yet, but manually setting the correct proj:transform before stacking confirmed that it should work once that's up. I wasn't able to get usable data without setting the proj:transform. Specifying resolution and bounds gave the correct array shape, but the data was ~95% nans. I can post reproducible examples of what I was doing if you're interested.

I do think defaulting to snap_bounds=True can lead to surprising behavior with the array shape and bounds not matching STAC/GDAL. Is there a downside to leaving fractional offsets? My preference would be to return the data generally "as-is", without snapping by default, unless there was a reason that might cause problems.

gjoseph92 commented 2 years ago

Specifying resolution and bounds gave the correct array shape, but the data was ~95% nans

Ah. I wouldn't be surprised if stackstac thought that certain chunks were entirely out of bounds for a certain asset, so it skipped opening and reading them entirely. I bet if you increased the spatial chunksize a ton (like maybe even to -1), you'd get the data you expect, but what's the point of that?

Is there a downside to leaving fractional offsets? My preference would be to return the data generally "as-is"

Especially for users newer to geospatial, not dealing with fractional offsets seemed more approachable and intuitive to me. And in many cases, data "is-is" goes out the window when you're stacking—you usually are warping most assets to a common grid and bounding box anyway, at least for sensors like landsat/sentinel/etc. So wherever there's been a tradeoff between usability and preservation of original data, I've gone for usability, since stackstac is less a tool for retrieving original data than for combining lots of pieces of original data together to get them into a format that's useful to you for downstream analysis.

aazuspan commented 2 years ago

I bet if you increased the spatial chunksize a ton (like maybe even to -1), you'd get the data you expect

Yep, chunksize=-1 plus specifying bounds and resolution did the trick. Good to know there's a workaround for incorrect STAC metadata, without having to manually edit it.

I've gone for usability, since stackstac is less a tool for retrieving original data than for combining lots of pieces of original data together

Totally reasonable. I was thinking in the context of nice, evenly-spaced 3dep tiles, but snapping definitely makes more sense when you have irregular, overlapping scenes like Sentinel.

Thanks for your help, and for building stackstac! I'm ready to close this, unless you'd rather keep it open in case someone else runs into it before the STAC is fixed.

gjoseph92 commented 2 years ago

Looks like https://github.com/stactools-packages/threedep/issues/9 is closed, but @TomAugspurger are we still waiting for the data to be reprocessed? If that will happen in the next week or so, let's leave it open, otherwise we can close.

TomAugspurger commented 2 years ago

Yeah, the items need to be regenerated with the patched stactools and re-ingested. No ETA on when that'll happen right now.

gjoseph92 commented 2 years ago

I've opened and pinned https://github.com/gjoseph92/stackstac/issues/152 for these sorts of issues, so hopefully others running into the same thing, or similar things in the future, will find that. So I'm closing this specific one now.