pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

Specify chunks in bytes #8021

Open mrocklin opened 1 year ago

mrocklin commented 1 year ago

Is your feature request related to a problem?

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes. I'm able to do this by backing out what xarray chooses in an open_zarr call and then provide the right chunks= argument. I'll admit though that I wouldn't mind giving Xarray a value like "1 GiB" though and having it use that when determining "auto" chunk sizes.

Dask array does this in two ways. We can provide a value in chunks as like the following:

x = da.random.random(..., chunks="1 GiB")

We also refer to a value in Dask config

In [1]: import dask

In [2]: dask.config.get("array.chunk-size")
Out[2]: '128MiB'

This is not very important (I'm unblocked) but I thought I'd mention it in case someone is looking for some fun work 🙂

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

jhamman commented 1 year ago

I like this suggestion! The trick will be to find a general way to map the chunk specification efficiently over the underlying storage backend's "preferred chunks" (e.g. #7948).

Note that you can get most of what you want today with the following syntax:

xr.open_dataset(..., chunks=None).chunk("1 GiB")

In the future, I think it would be quite nice if we supported:

xr.open_dataset(..., chunks="1 GiB")

where the resulting chunks were constructed, as best as possible, to align with the chunks of the underlying dataset (e.g. Zarr, HDF5).

xref: #1440

denis-bz commented 1 year ago

Hi @mrocklin

I'm playing around with xarray performance and would like a way to easily tweak chunk sizes.

me too. What happens in the null case, NO chunking ? What I want to do is copy 2 GB slow.nc to fast.nc once, right after download so that subsequent xr.open_dataset( "fast.nc", chunks=❓ ).to_numpy() are as fast as possible. A simple testbench along the lines

cube = Randomcube( gb=2 )  # (8760, 247, 247)  2039 mb
xar = toxarray( cube.values, name="Randomcube" )
# xar.encoding = ❓
nc = "/tmp/tmp.nc"
xar.to_netcdf( nc, format="netCDF4", engine="netcdf4" )

for chunks in [ "auto", {}, None ]:  # last fastest ??
    ptime()
    with xr.load_dataarray( nc, chunks=chunks ) as load:  # load / open ?
        tonumpy = load.to_numpy()
    ptime( f"xr.load_dataarray( { chunks = }  {nc} ) .to_numpy" )

shows odd results on my old imac. Can you comment (move to discussions) or know of other testbenches ? Thanks, cheers -- denis

jhamman commented 1 year ago

What happens in the null case, NO chunking ?

First thing to consider is whether your netcdf4 file is chunked or contiguous on disk. If it is not chunked on disk, Xarray and Dask can not do much to optimize partial array decompression. If it is chunked on disk, you'll likely find the best performance aligning your read chunks to the chunks on disk. https://github.com/pydata/xarray/pull/7948 recently added support chunks='auto' or chunks={} so make sure you are using the latest release.

Can you comment (move to discussions) or know of other testbenches ?

I'd like to leave this issue here because the feature described above still applies. I would encourage you to open a discussion for a more detailed conversation.

denis-bz commented 1 year ago

@jhamman, apart from chunking, this testcase shows chunks="auto" much slower than {} and None -- on my old imac. Anyone have time or interest to run this on a modern machine or two ? then I'd put the code in a gist, then discuss.

Multilevel caches SSD L2 L1 vary a LOT => many timing tests > 1 GB are for the Journal of Irreproducible Results