gjoseph92 / stackstac

Turn a STAC catalog into a dask-based xarray
https://stackstac.readthedocs.io
MIT License
245 stars 49 forks source link

Revamp Dask Array creation logic to be fully blockwise #105

Closed gjoseph92 closed 2 years ago

gjoseph92 commented 2 years ago

When https://github.com/dask/dask/pull/7417 gets in, it may both break the current dask-array-creation hacks, and open the door for a much, much simpler approach: just use da.from_array.

We didn't use da.from_array originally and went through all the current rigamarole because from_array generates a low-level graph, which can be enormous (and slow) for the large datasets we load. But once from_array uses Blockwise, it will be far simpler and more efficient.

We'll just switch to having an array-like class that wraps the asset table and other parameters, and whose __getitem__ basically calls fetch_raster_window. However, it's likely worth thinking about just combining all this into the Reader protocol in some way.

This will also make it easier to do https://github.com/gjoseph92/stackstac/issues/62 (and may even inadvertently solve it).

gjoseph92 commented 2 years ago

Thinking about this a bit more, we may want to stick with the current approach (or a form of it).

Currently, we make one initial array of (time, band)—this is the "asset table". Each asset (file) is its own chunk. We map a function over this opening each file.

Then we make an array of (y, x) containing the slice needed to do a windowed read to cover that spatial area. This is currently low-level, but could easily become blockwise using a BlockwiseDep like this.

When you broadcast those two arrays together, you get the (time, band, y, x) final array.

The nice things are:

  1. The Reader (rio Dataset) for each file is a separate dask key. This is reasonable—Datasets are not lightweight things to move around.
  2. Opening each Dataset gets its own task.
  3. Reading from the dataset depends on the dataset's key (as opposed to implicitly opening the dataset). This means Dask will prefer to schedule reads from a given Dataset onto workers that already have that dataset open.

    Without this, I'd worry that we'd basically end up re-opening every dataset on every worker.

If we used from_array with a giant array-like object, we'd kind of be saying that opening any particular file, at any particular coordinates, had equal cost. Opening two different files at the same coordinates appear to be equivalent to opening the same file at two different coordinates—clearly not true.

gjoseph92 commented 2 years ago

Addressed in https://github.com/gjoseph92/stackstac/pull/116