pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

A basic default ChunkManager for arrays that report their own chunks #8733

Open hmaarrfk opened 9 months ago

hmaarrfk commented 9 months ago

Is your feature request related to a problem?

I'm creating duckarrays for various file backed datastructures for mine that are naturally "chunked". i.e. different parts of the array may appear in completely different files.

Using these "chunks" and the "strides" algorithms can better decide on how to iterate in a convenient manner.

For example, an MP4 file's chunks may be defined as being delimited by I frames, while images stored in a TIFF may be delimited by a page.

So for me, chunks are not so useful for parallel computing, but more for computing locally and choosing the appropriate way to iterate through a large arrays (TB of uncompressed data).

Describe the solution you'd like

I think a default Chunk manager could simply implement compute as np.asarray as a default instance, and be a catchall to all other instances.

Advanced users could then go in an reimplement their own chunkmanager, but I was unable to use my duckarrays that incldued a chunk property because they weren't associated with any chunk manager.

Something as simple as:

diff --git a/xarray/core/parallelcompat.py b/xarray/core/parallelcompat.py
index c009ef48..bf500abb 100644
--- a/xarray/core/parallelcompat.py
+++ b/xarray/core/parallelcompat.py
@@ -681,3 +681,26 @@ class ChunkManagerEntrypoint(ABC, Generic[T_ChunkedArray]):
         cubed.store
         """
         raise NotImplementedError()
+
+
+class DefaultChunkManager(ChunkMangerEntrypoint):
+    def __init__(self) -> None:
+        self.array_cls = None
+
+    def is_chunked_array(self, data: Any) -> bool:
+        return is_duck_array(data) and hasattr(data, "chunks")
+
+    def chunks(self, data: T_ChunkedArray) -> T_NormalizedChunks:
+        return data.chunks
+
+    def compute(self, *data: T_ChunkedArray | Any, **kwargs) -> tuple[np.ndarray, ...]:
+        raise tuple(np.asarray(d) for d in data)
+
+    def normalize_chunks(self, *args, **kwargs):
+        raise NotImplementedError()
+
+    def from_array(self, *args, **kwargs):
+        raise NotImplementedError()
+
+    def apply_gufunc(self, *args, **kwargs):
+        raise NotImplementedError()

Describe alternatives you've considered

I created my own chunk manager, with my own chunk manager entry point.

Kinda tedious...

Additional context

It seems that this is related to: https://github.com/pydata/xarray/pull/7019

hmaarrfk commented 9 months ago

Even removing the "abstractmethod" decorator would be welcome ;)

TomNicholas commented 9 months ago

Hi @hmaarrfk ! I'm excited to see someone is actually using the ChunkManager system 😀

I'm curious to hear more about your use case - do you have the code for your custom ChunkManager somewhere?

I'm not totally sure I understand what the purpose of a "default" chunkmanager would be - IIUC you're trying to expose your underlying array's .chunks attribute but treat it like a normal duckarray in all other respects?

Even removing the "abstractmethod" decorator would be welcome ;)

Which one? Or you mean all of them except literally just __init__ and chunks?

hmaarrfk commented 9 months ago

do you have the code for your custom ChunkManager somewhere?

I've provided the patch that is essentially my code for the chunk manager ;).

I am a little stricter and I check for isinstance instead of hasattr because i don't want to overwrite the dask chunk manager you have all written. but if you were to have a "default" chunkmanager it could be more forgiving with the hasattr check since it would be the last chunkmanager to get checked.

IIUC you're trying to expose your underlying array's .chunks attribute but treat it like a normal duckarray in all other respects?

100% correct. Its my own implementation of "lazy access" to TIFF and MP4s as read-only arrays. It really only implements the very features I need so it isn't really "full feature complete". I've found this necessary to implement myself for performance reasons.

Or you mean all of them except literally just init and chunks?

I mean all of them except __init__.

edit: reordered some paragraphs

TomNicholas commented 9 months ago

Okay overall this seems like a good idea to me! Basically allowing for the idea that there are really three types of arrays: un-chunked duck arrays, chunked duck arrays where the chunks might be read but won't be changed, and re-chunkable duck arrays where the re-chunking happens via some (probably parallel) processing framework.

I am a little stricter and I check for isinstance instead of hasattr because i don't want to overwrite the dask chunk manager you have all written. but if you were to have a "default" chunkmanager it could be more forgiving with the hasattr check since it would be the last chunkmanager to get checked.

Sorry I'm not quite following this part. You're talking about inside is_chunked_array? I don't think anything should be relying on the order of which chunkmanager is checked against first.

TomNicholas commented 9 months ago

Also note there are two senses of the idea of a "default" chunkmanager to keep track of here: (1) what is used to request the .chunks attribute of a duckarray of unspecified type, and (2) which chunkmanager to use to .chunk an array that was not previously chunked (this has to be dask for backwards compatibility reasons).

hmaarrfk commented 9 months ago

Sorry I'm not quite following this part.

This is my code ;)

import numpy as np
from xarray.core.parallelcompat import ChunkManagerEntrypoint
from xarray.core.types import T_NormalizedChunks
from ._my_array import MyArray1, MyArray2

class MyChunkManager(ChunkManagerEntrypoint["MyArrayEntryPoint"]):
    array_cls: None
    available: bool = True

    def __init__(self) -> None:
        self.array_cls = (MyArray1, MyArray2)

    def chunks(self, data) -> T_NormalizedChunks:
        return data.chunks

    def compute(self, *data, **kwargs) -> tuple[np.ndarray, ...]:
        return tuple(np.asarray(d) for d in data)

    def apply_gufunc(self, *args, **kwargs):
        raise NotImplementedError()

    def from_array(self, *args, **kwargs):
        raise NotImplementedError()

    def normalize_chunks(self, *args, **kwargs):
        raise NotImplementedError()

Also note there are two senses of the idea of a "default" chunkmanager to keep track of here

You are correct. To be explicit, my problem is that implementing a whole chunkmanager, that allows for rechunking and many other fancy operations, seems a little overkill when I (or other users) try to expose a .chunks property. If I add a .chunks property without building out my ChunkManager Machinery, I get the following traceback

  File "/home/mark/git/xarray/xarray/core/dataset.py", line 842, in load
    return dataset.isel(isel).compute()
  File "/home/mark/git/xarray/xarray/core/dataset.py", line 1011, in compute
    chunkmanager = get_chunked_array_type(*lazy_data.values())
  File "/home/mark/git/xarray/xarray/core/parallelcompat.py", line 142, in get_chunked_array_type
    return new.load(**kwargs)
  File "/home/mark/git/xarray/xarray/core/dataset.py", line 842, in load
    raise TypeError(
TypeError: Could not find a Chunk Manager which recognises type <class 'mark._my_array.MyArray1'>

I'm sorry for jumping to a proposal for a solution. I'm trying to get better at framing my asks overall (even beyond open source).

hmaarrfk commented 9 months ago

The other reason why I think it is good for xarray to consider, is that I was pretty close to avoiding exposing .chunks because understanding the entrypoints is quite an "advanced python packaging" thing.

It is clever and avoids circular dependencies and allows for plugins to be built.

But a developer that writes python code and not packaging code it is quite difficult....

TomNicholas commented 9 months ago

I'm sorry for jumping to a proposal for a solution. I'm trying to get better at framing my asks overall (even beyond open source).

No this is great! Some initiative in thinking of possible solutions is always appreciated :)


So there's arguably a more general issue here: If you have a new duckarray type, xarray will happily wrap that and use it without you explicitly telling xarray what type it is (hence the term "duck-typed".) But currently if you have a duckarray type that also implements a .chunks attribute, xarray requires you to explicitly register that type using the chunkmanager system. This is unnecessarily restrictive for the case of arrays that are just duckarrays + .chunks.

The reason we have chunkmanagers at all is for when chunked computation requires calling a function like apply_gufunc, for which we need to know if the underlying type is dask or cubed or whatever, so that we can use the correct import of dask.apply_gufunc or cubed.apply_gufunc. But your use case doesn't need this complexity.


One solution would be to just remove .chunks from the ChunkManager entirely, and instead just call .chunks on the underlying array immediately. This would work, but it kind of breaks the nice way we now have all chunk-related stuff go through the ChunkManager class. Another disadvantage of this would be if you had an array with an attribute that was like .chunks but not actually called .chunks, we couldn't re-direct it.

Another solution is similar to what you're suggesting: define a default ChunkManager (called DuckArrayChunkManager or something) that forwards .chunks on but has no other methods defined. However this needs to work for any compliant array type, not just for MyArray. I guess that's what you were referring to here:

if you were to have a "default" chunkmanager it could be more forgiving with the hasattr check since it would be the last chunkmanager to get checked.

TomNicholas commented 9 months ago

FYI @andersy005

TomNicholas commented 9 months ago

@hmaarrfk would you be interested in submitting a PR to add the "default" chunk manager?

dcherian commented 9 months ago

Is your ask simply that we forward .chunks to .data.chunks, if available? That seems OK and a lot simpler than adding a "default chunk manager"

TomNicholas commented 9 months ago

Yeah we might have fallen down a rabbit hole here...

hmaarrfk commented 9 months ago

Is your ask simply that we forward .chunks to .data.chunks, if available? That seems OK and a lot simpler than adding a "default chunk manager"

What kind of property could one use to identify arrays that support distributed computing?

.chunks was used for that purpose.

TomNicholas commented 9 months ago

dask and cubed both have a .compute method, so that's an option

dcherian commented 9 months ago

What kind of property could one use to identify arrays that support distributed computing?

Sorry if I'm being dense, but I thought all you need is to access the underlying .data.chunks. If you need the computational pieces then "chunk manager" is the way to go. If you want to read to memory with np.asarray .as_numpy should probably work.

TomNicholas commented 9 months ago

I thought all you need is to access the underlying .data.chunks. If you need the computational pieces then "chunk manager" is the way to go.

Yes, but I believe @hmaarrfk is asking about what to do with the existing pycompat.is_chunked_array function, which currently checks for the existence of a .chunks attribute. Currently it will trigger on his chunked array and start involving the chunkmanager system, and if we forward .chunks to data.chunks but don't change this function it will still involve the chunkmanager system even though we don't want to. So we need to replace it with some other kind of check that can distinguish between an "inert" chunked array and an array that needs the computational pieces.

hmaarrfk commented 9 months ago

Sorry if I'm being dense, but I thought all you need is to access the underlying .data.chunks. If you need the computational pieces then "chunk manager" is the way to go. If you want to read to memory with np.asarray .as_numpy should probably work.

no need to appologize. The thread got kinda long.

All i need is for xarray not to complain if I add a .chunks property to my duck array. Let me create a test for this. It may make things more clear.

I'm trying to do so without destroying the pluggable backends you all built for dask/cubed.

dcherian commented 9 months ago

Ah thanks for explaining.

maybe hasattr(array, "chunks") and type(array) in chunkmanager_handled_types or something?

TomNicholas commented 9 months ago

Ah thanks for explaining.

Of course.

maybe hasattr(array, "chunks") and type(array) in chunkmanager_handled_types or something?

There currently is no global chunkmanager_handled_types, instead if pycompat.is_chunked_array(data) returns True, then the chunkmanagers become involved and start checking if they each recognise the specific type. This seemed sensible before (involve chunkmanagers if and only if the array has .chunks) but perhaps this is a bad design, and we should consolidate all those checks into a single check against a global list of CHUNKMANAGER_HANDLED_TYPES.

hmaarrfk commented 9 months ago

See the suggestion for the lack of global in the PR.

It seems like a solvable problem if changes to the function signature are ok

hmaarrfk commented 8 months ago

Sorry for not responding for a while.

An other thing I learned in my usage is that the definition of:

    def store(
        self,
        sources: Any,
        targets: Any,
        **kwargs: dict[str, Any],
    ) -> Any:
        # basic error checking for things i don't support like regions

        for source, target in zip(sources, targets):
            for s in iterchunk_slices(source.shape, source.chunks):
                target[s] = source[s]

was necessary to get things to serialize to an nc file.

The fact that my implementation of store was so narrow in scope made me less enthusiastic to push this to a broader context.

TomNicholas commented 6 months ago

My turn to apologise for not responding for a while!

All i need is for xarray not to complain if I add a .chunks property to my duck array.

I realised I have exactly the same problem as you do in this other library - see https://github.com/TomNicholas/VirtualiZarr/issues/114.

My understanding of your PR is to just change xarray's behaviour upon encoutering an array with chunks that it doesn't know how to compute from raising an error to silently passing through.

This would solve both our errors but the problem with this is that if you try to compute a dask/cubed array without the corresponding library installed, the correct chunkmanager won't be defined, and a non-loaded version of an array which definitely should be loaded will be passed further through the code.

An alternative suggestion would be to change the check from checking for chunks to checking for a compute method being defined. That is really the property we care about here, the array being computable, and it would cover dask and cubed arrays. If the array doesn't have a compute method we assume it just a funny type of (non-computable) duck array and pass it through. (Of course this means that now ChunkManager is a really misleading name... 😬 )

hmaarrfk commented 6 months ago

it might be good to get usage feedback from one more person.

Silently failing would have made my application's memory grow to 500GB, and just kill itself, so the error message was useful.

We could link them to this discussion and learn more from an other real usage case.

My chunk manager will likely persist in my codebase for the foreseeable future.

TomNicholas commented 4 months ago

Okay so I just hit this issue again in virtualizarr but for .rechunk rather than .chunks (see https://github.com/zarr-developers/VirtualiZarr/pull/199#issuecomment-2244586390).

In hindsight I now think that this taxonomy of mine is not quite right

there are really three types of arrays: un-chunked duck arrays, chunked duck arrays where the chunks might be read but won't be changed, and re-chunkable duck arrays where the re-chunking happens via some (probably parallel) processing framework.

and that the ChunkManagerEntrypoint ABC I added to xarray to deal with this conflates multiple problems.

Instead we have arrays, which may have chunks (and may be rechunkable), and separately may be computable. One can imagine chunked arrays with no special computing semantics beyond defining __array__ (i.e. @hmaarrfk 's chunked image data structure), chunked but non-computable arrays (virtualizarr ManifestArray objects), but you might also imagine computable but non-chunked arrays (e.g. ArkoudaArray cc @jeremiah-corrado or potentially ramba arrays). "Computable but not chunked" also arguably applies to cupy, pint, and sparse, all of which are currently special-cased inside xarray's .values/to_numpy methods, that get called by .compute.

One suggestion for how to deal with this: pass all .chunk and .rechunk calls onto the underlying type automatically (so not requiring any entrypoint to be registered), rename xarray's existing ChunkManager to ComputeManager, and any call involving computing an array first looks for __array__ then checks for the presence of the corresponding ComputeManager.

This suggestion is subtly different to either what @hmaarrfk or @dcherian suggested above. Crucially it means that (a) rechunking wrapped arrays doesn't require defining an entrypoint but (b) computing with custom semantics always does require a corresponding entrypoint, and if there isn't the correct ComputeManager defined xarray would fail loudly, clearly and eagerly.

This would imply minor changes to cubed-xarray, which would define a CubedManager(ComputeManagerEntryPoint) instead of a CubedManager(ChunkManagerEntryPoint), and potentially the addition of a CupyManager to cupy-xarray (and similar for pint-xarray and sparse). Though in reality we don't need to rush to remove the existing special-cases inside xarray. Cupy/sparse/pint also don't need all the blockwise/apply_gufunc stuff in the ComputeManager, so those can just be passed through, probably by a default implementation on the ComputeManagerEntrypoint superclass ABC.

Another interesting possible use case might be a JAXComputeManager, which could implement apply_gufunc in terms of jax.pmap.

it might be good to get usage feedback from one more person.

What do you think @keewis @dcherian @negin513 @tomwhite @alxmrs @slevang ? Is this a reasonable way to abstract out these special-cases? Or is cupy/pint/sparse defining an entrypoint just so that their one final compute method works actually massive overkill?

(Alternative idea for cupy - define to_device from the array API standard and xarray calls that if defined?)

Note also that this is related to how the array API standard does not define a specific method for materializing lazy arrays (https://github.com/data-apis/array-api/issues/748).

dcherian commented 4 months ago

I think the error here was that the ChunkManager ABC is too greedy and handles .chunks and .rechunk. We can easily fix that.

Or is cupy/pint/sparse defining an entrypoint just so that their one final compute method works actually massive overkill?

Yes.

You are also conflating "compute" with "to_numpy". And I don't think it's a fundamental property --- that VirtualiZarr does not implement it seems like an implementation detail.

"Compute" in Xarray means load lazy arrays to memory -- here lazy arrays are wrappers around on-disk data, or dask/cubed arrays. Usually, this means load as numpy, but dask could end up "computing" to cupy or sparse instead. With Zarr-to-GPU support, even Xarray's lazy loading classes will resolve to cupy.


any call involving computing an array first looks for array

No, this would be quite the regression. .compute on a "on-device" or loaded array is idempotent today. Your proposal would error for cupy & sparse at least. We want this as a fallback, and it is implemented that way (see to_duck_array). This emphasizes the previous point: compute on a duck array that isn't recognized as a chunked array is left as-is, anything else is coerced to numpy.


IIUC this mostly gets resolved if the ChunkManager is less greedy and doesn't trigger on the existence of .chunks but is instead triggered on matching an allowlist of registered chunk array types.

PS: Apparently the array api defines asarray so we could generalize to_numpy and as_numpy to to_array(xp) and as_array(xp)

TomNicholas commented 4 months ago

thanks @dcherian

I think the error here was that the ChunkManager ABC is too greedy and handles .chunks and .rechunk. We can easily fix that.

:+1:

that VirtualiZarr does not implement it seems like an implementation detail.

"Compute" in Xarray means load lazy arrays to memory

I don't think I agree - in VirtualiZarr there is no definition of "Compute" that would return a numpy-like in-memory array to still be wrapped by xarray. The closest operation to "compute" in virtualizarr writes json/parquet files to disk. (Though if we integrated virtualizarr upstream in zarr-python then it would be a different story - see https://github.com/pydata/xarray/issues/9281.)

No, this would be quite the regression. .compute on a "on-device" or loaded array is idempotent today. Your proposal would error for cupy & sparse at least.

You're right, that was a sloppy description on my part.

anything else is coerced to numpy.

But currently xarray has internal special-casing to know how to coerce pint/sparse/cupy to numpy, as they all use different semantics (.magnitude/.todense/.get). That xarray has to keep track of these types in a non-extensible way is a complaint that's off the topic of this original issue but it still could be solved via the chunkmanager system (if we wanted to do that, I'm not saying it's worth the added complexity).

PS: Apparently the array api defines asarray so we could generalize to_numpy and as_numpy to to_array(xp) and as_array(xp)

Unless I've misunderstood something I don't think we can do that, because the return type of as_array is not np.ndarray, it's just the duck array xp.Array. This leaves the special-casing inside xarray's to_numpy unsolved.

IIUC this mostly gets resolved if the ChunkManager is less greedy and doesn't trigger on the existence of .chunks but is instead triggered on matching an allowlist of registered chunk array types.

I agree that this would probably close this particular issue though. Currently it triggers on seeing .chunks and then checks against an allowlist, it would be easy to change that, without any need for a "default" chunkmanager.

dcherian commented 4 months ago

I don't think I agree - in VirtualiZarr there is no definition of "Compute" that would return a numpy-like in-memory array to still be wrapped by xarray. .... Though if we integrated virtualizarr upstream in zarr-python then it would be a different story - see https://github.com/pydata/xarray/issues/9281.)

There is no implemented definition of compute, in the sense of "give me concrete values from this abstract array". You make my point in the second line -- you could return concrete values, it just hasn't been implemented yet.

But currently xarray has internal special-casing to know how to coerce pint/sparse/cupy to numpy, as they all use different semantics

Not just that but also they actively error on __array__ for historical reasons. It'd be nice if that weren't the case.

the return type of as_array is not np.ndarray,

That's what I meant by "generalize". It seems like we can attempt to allow arbitrary array-array conversions.

Currently it triggers on seeing .chunks and then checks against an allowlist, it would be easy to change that, without any need for a "default" chunkmanager.

Nice, let's do that!