Open jakirkham opened 5 years ago
+1. The split_by_chunks
method in this comment (https://github.com/pydata/xarray/issues/1093#issuecomment-259213382) would also be useful for more general per-chunk manipulation.
That sounds somewhat similar to .blocks
accessor in Dask Array. ( https://github.com/dask/dask/pull/3689 ) Maybe we should align on that as well?
Another approach for the split_by_chunks
implementation would be...
def split_by_chunks(a):
for sl in da.core.slices_from_chunks(a.chunks):
yield (sl, a[sl])
While a little bit more cumbersome to write, this could be implemented with .blocks
and may be a bit more performant.
def split_by_chunks(a):
for i, sl in zip(np.ndindex(a.numblocks), da.core.slices_from_chunks(a.chunks)):
yield (sl, a.blocks[i])
If the slices are not strictly needed, this could be simplified a bit more.
def split_by_chunks(a):
for i in np.ndindex(a.numblocks):
yield a.blocks[i]
Admittedly slices_from_chunks
is an internal utility function. Though it is unlikely to change. We could consider exposing it as part of the API if that is useful.
We could consider other things like making .blocks
iterable, which could make this more friendly as well. Raised issue ( https://github.com/dask/dask/issues/5117 ) on this point.
map_blocks
went in as of #3276. We'll leave this open for the future work implementing map_overlap
.
I'm thinking through a map_overlap
API right now. In dask, map_overlap requires a few extra arguments
depth: int, tuple, dict or list
The number of elements that each block should share with its neighbors
If a tuple or dict then this can be different per axis.
If a list then each element of that list must be an int, tuple or dict
defining depth for the corresponding array in `args`.
Asymmetric depths may be specified using a dict value of (-/+) tuples.
Note that asymmetric depths are currently only supported when
``boundary`` is 'none'.
The default value is 0.
boundary: str, tuple, dict or list
How to handle the boundaries.
Values include 'reflect', 'periodic', 'nearest', 'none',
or any constant value like 0 or np.nan.
If a list then each element must be a str, tuple or dict defining the
boundary for the corresponding array in `args`.
The default value is 'reflect'.
In dask.array
those must be dicts whose keys are the axis number. For xarray we would want to allow the dimension names there.
I'm not sure how to handle the DataArray labels for the boundary chunks (dask docs at https://docs.dask.org/en/latest/array-overlap.html#boundaries). For reflect
/ periodic
I think things are OK, we perhaps just use the label associated with that value. I'm not sure what to do for constants.
This issue about coordinate labels for boundaries exists with pad too: https://github.com/pydata/xarray/issues/3868
Can map_overlap
just use DataArray.pad
and we can fix things there?
Or perhaps we can expect users to add a call to pad
before map_overlap
?
Thanks for that link. I hope that map_overlap could use pad internally for the external boundaries.
On Mon, Aug 3, 2020 at 3:22 PM Deepak Cherian notifications@github.com wrote:
This issue about coordinate labels for boundaries exists with pad too:
3868 https://github.com/pydata/xarray/issues/3868
Can map_overlap just use DataArray.pad and we can fix things there?
Or perhaps we can expect users to add a call to pad before map_overlap?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3147#issuecomment-668223125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWLGJZYO63S7IXTEH3R64MAZANCNFSM4IFAIWOA .
Yeah +1 for using pad
instead. Had tried to get rid of map_overlap
's padding and use da.pad
in Dask as well ( https://github.com/dask/dask/pull/5052 ), but haven't had time to get back to that.
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here or remove the stale
label; otherwise it will be marked as closed automatically
Would be good to keep this open
Indeed, this would be very useful in a great number of cases!
very much in need of this one to able satellite image filtering across blocks
+1 that would be super useful.
Just as there are
map_blocks
andmap_overlap
methods for Dask Array, it would be useful to have equivalent methods for Xarray objects. This would make it easier to leverage duck typing to work with both Dask Arrays and Xarray objects.Edit: Should add this came up a few times at the recent SciPy sprints.