pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

Implementing map_overlap #3147

Open jakirkham opened 5 years ago

jakirkham commented 5 years ago

Just as there are map_blocks and map_overlap methods for Dask Array, it would be useful to have equivalent methods for Xarray objects. This would make it easier to leverage duck typing to work with both Dask Arrays and Xarray objects.

Edit: Should add this came up a few times at the recent SciPy sprints.

dcherian commented 5 years ago

+1. The split_by_chunks method in this comment (https://github.com/pydata/xarray/issues/1093#issuecomment-259213382) would also be useful for more general per-chunk manipulation.

jakirkham commented 5 years ago

That sounds somewhat similar to .blocks accessor in Dask Array. ( https://github.com/dask/dask/pull/3689 ) Maybe we should align on that as well?

jakirkham commented 5 years ago

Another approach for the split_by_chunks implementation would be...

def split_by_chunks(a):
    for sl in da.core.slices_from_chunks(a.chunks): 
        yield (sl, a[sl])

While a little bit more cumbersome to write, this could be implemented with .blocks and may be a bit more performant.

def split_by_chunks(a):
    for i, sl in zip(np.ndindex(a.numblocks), da.core.slices_from_chunks(a.chunks)):
        yield (sl, a.blocks[i])

If the slices are not strictly needed, this could be simplified a bit more.

def split_by_chunks(a):
    for i in np.ndindex(a.numblocks):
        yield a.blocks[i]

Admittedly slices_from_chunks is an internal utility function. Though it is unlikely to change. We could consider exposing it as part of the API if that is useful.

We could consider other things like making .blocks iterable, which could make this more friendly as well. Raised issue ( https://github.com/dask/dask/issues/5117 ) on this point.

jhamman commented 5 years ago

map_blocks went in as of #3276. We'll leave this open for the future work implementing map_overlap.

TomAugspurger commented 4 years ago

I'm thinking through a map_overlap API right now. In dask, map_overlap requires a few extra arguments

    depth: int, tuple, dict or list
        The number of elements that each block should share with its neighbors
        If a tuple or dict then this can be different per axis.
        If a list then each element of that list must be an int, tuple or dict
        defining depth for the corresponding array in `args`.
        Asymmetric depths may be specified using a dict value of (-/+) tuples.
        Note that asymmetric depths are currently only supported when
        ``boundary`` is 'none'.
        The default value is 0.
    boundary: str, tuple, dict or list
        How to handle the boundaries.
        Values include 'reflect', 'periodic', 'nearest', 'none',
        or any constant value like 0 or np.nan.
        If a list then each element must be a str, tuple or dict defining the
        boundary for the corresponding array in `args`.
        The default value is 'reflect'.

In dask.array those must be dicts whose keys are the axis number. For xarray we would want to allow the dimension names there.

I'm not sure how to handle the DataArray labels for the boundary chunks (dask docs at https://docs.dask.org/en/latest/array-overlap.html#boundaries). For reflect / periodic I think things are OK, we perhaps just use the label associated with that value. I'm not sure what to do for constants.

dcherian commented 4 years ago

This issue about coordinate labels for boundaries exists with pad too: https://github.com/pydata/xarray/issues/3868

Can map_overlap just use DataArray.pad and we can fix things there?

Or perhaps we can expect users to add a call to pad before map_overlap?

TomAugspurger commented 4 years ago

Thanks for that link. I hope that map_overlap could use pad internally for the external boundaries.

On Mon, Aug 3, 2020 at 3:22 PM Deepak Cherian notifications@github.com wrote:

This issue about coordinate labels for boundaries exists with pad too:

3868 https://github.com/pydata/xarray/issues/3868

Can map_overlap just use DataArray.pad and we can fix things there?

Or perhaps we can expect users to add a call to pad before map_overlap?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3147#issuecomment-668223125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWLGJZYO63S7IXTEH3R64MAZANCNFSM4IFAIWOA .

jakirkham commented 4 years ago

Yeah +1 for using pad instead. Had tried to get rid of map_overlap's padding and use da.pad in Dask as well ( https://github.com/dask/dask/pull/5052 ), but haven't had time to get back to that.

stale[bot] commented 2 years ago

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

jakirkham commented 2 years ago

Would be good to keep this open

j2bbayle commented 2 years ago

Indeed, this would be very useful in a great number of cases!

jdoblas commented 1 year ago

very much in need of this one to able satellite image filtering across blocks

odhondt commented 1 month ago

+1 that would be super useful.