Open gjoseph92 opened 8 months ago
@gjoseph92 what is preventing you from merging? Would a deep code review help? I tried going through it quickly, but i's too far in my head and there has been a looot of changes from my initial code but in a good way though.
@Berhinj simply haven't gotten around to taking another look at it. I wanted to give it a few days before taking another pass at self-review. If you could try out this branch on your use case and confirm it helps, that would be very helpful to get a real-world test!
@gjoseph92 just tried it and compared to the results I was getting from my PR, the multicoordinnates aspect is working good!
Issue though, I notice that the "earthsearch:boa_offset_applied" boolean array from sentinel 2 became float... Same for the "s2:processing_baseline" coords which went from characters to float
I notice that the "earthsearch:boa_offset_applied" boolean array from sentinel 2 became float
Good point. I should special-case booleans.
@gjoseph92 just tried it and compared to the results I was getting from my PR, the multicoordinnates aspect is working good!
Same for the "s2:processing_baseline" coords which went from characters to float
I can't reproduce this. I'm getting array('05.09', dtype=object)
, which is still a string. I'm just running through the basic example from the docs for this.
@Berhinj booleans should be handled correctly now; I'm getting earthsearch:boa_offset_applied
as a bool now. Do you have any other feedback here?
@gjoseph92 no, that's perfect, thanks
any chance this will be merged soon? :)
This refactors the STAC metadata -> xarray coordinates logic to support multi-dimensional coordinates. For example, imagine a field like
rescaling
(orhref
, for that matter), that's different for every asset of every item. Previously, we'd drop this field. Now, we can store it as a 2D array indexing the dimensionstime, band
.This builds off of @Berhinj's work in https://github.com/gjoseph92/stackstac/pull/222, which provided a lot of inspiration for this design.
This is a significant rewrite of the metadata logic. The basic idea now is, for each field in STAC metadata:
time, band
dimensions)type
field will be an MxN array ofimage/tiff; application=geotiff
over and over; this is collapsed down into a 0D scalar.Unfortunately, it does come at a slight performance cost:
stackstac.stack
on 10,000 Landast-8 items is 13.6% slower, going from 13.2s to 15s. (Of course, there's no performance difference with actually computing the results, which is where the bulk of time is spent anyway.)There are a number of benefits, however:
roles
field, which contains variable-length lists of tags per band like["data"]
,["cloud", "cloud-shadow"]
, is now stored. Same with complex things likeclassification:bitfields
. More interestingly, we could now easily retaingeometry
,bbox
, etc. xref https://github.com/gjoseph92/stackstac/issues/6raster:bands
subfields are now in coordinates, and supporting extensions likeeo:bands
,raster:bands
, etc. is now generalized and easy to extend in the futurerescale
logic could be simplified into a one-liner after the stack is built (stack * stack.coords['raster:bands_scale'] + stack.coords['raster:bands_offset']
), rather than plumbing scale/offset factors through many layers. (TBD if this makes a nasty Dask graph though.)There's maybe even a future where the logic to generate the Dask array takes these coordinates as input, rather than raw STAC dicts or pystac items? Perhaps if we can someday store the geometries as a GeoSeries, or something like that.
Closes #216