Closed gjoseph92 closed 2 years ago
Thinking about this a bit more, we may want to stick with the current approach (or a form of it).
Currently, we make one initial array of (time, band)
—this is the "asset table". Each asset (file) is its own chunk. We map a function over this opening each file.
Then we make an array of (y, x)
containing the slice
needed to do a windowed read to cover that spatial area. This is currently low-level, but could easily become blockwise using a BlockwiseDep like this.
When you broadcast those two arrays together, you get the (time, band, y, x)
final array.
The nice things are:
Reader
(rio Dataset) for each file is a separate dask key. This is reasonable—Datasets are not lightweight things to move around.Reading from the dataset depends on the dataset's key (as opposed to implicitly opening the dataset). This means Dask will prefer to schedule reads from a given Dataset onto workers that already have that dataset open.
Without this, I'd worry that we'd basically end up re-opening every dataset on every worker.
If we used from_array
with a giant array-like object, we'd kind of be saying that opening any particular file, at any particular coordinates, had equal cost. Opening two different files at the same coordinates appear to be equivalent to opening the same file at two different coordinates—clearly not true.
Addressed in https://github.com/gjoseph92/stackstac/pull/116
When https://github.com/dask/dask/pull/7417 gets in, it may both break the current dask-array-creation hacks, and open the door for a much, much simpler approach: just use
da.from_array
.We didn't use
da.from_array
originally and went through all the current rigamarole becausefrom_array
generates a low-level graph, which can be enormous (and slow) for the large datasets we load. But oncefrom_array
uses Blockwise, it will be far simpler and more efficient.We'll just switch to having an array-like class that wraps the asset table and other parameters, and whose
__getitem__
basically callsfetch_raster_window
. However, it's likely worth thinking about just combining all this into the Reader protocol in some way.This will also make it easier to do https://github.com/gjoseph92/stackstac/issues/62 (and may even inadvertently solve it).