openclimatefix / ocf-data-sampler

A test repo to experiment refactoring ocf_datapipes
MIT License
1 stars 1 forks source link

Investigate xarray chunking #30

Open dfulu opened 1 month ago

dfulu commented 1 month ago

When loading multiple zarr files using xarray, I have noticed that it often changes the chunk sizes despite the zarrs having the same chunking when saved to disk. Often it will double the chunk size. This is wasteful since it means we are then doubling the data we load off of disk just to get access to a small piece of it. This then slows down our sampling significantly

We should investigate this further and see if there is a better way to load multiple zarr files with xarray

e.g. In the example below xarray makes the chunks 27 times larger!

Note that I haven't printed the time dimension here. Where we open the two files individually the time chunk sizes are 12. Where we open them together the chunk size becomes 36

import xarray as xr

# Load two zarr files independently
path1 = "/mnt/disks/nwp_rechunk/sat/2020_nonhrv.zarr"
path2 = "/mnt/disks/nwp_rechunk/sat/2021_nonhrv.zarr"

ds1 = xr.open_zarr(path1)
ds2 = xr.open_zarr(path2)

# Check all coords except time have same values
assert (ds1.variable==ds2.variable).all()
assert (ds1.x_geostationary==ds2.x_geostationary).all()
assert (ds1.y_geostationary==ds2.y_geostationary).all()

# Check the chunk sizes are the same and print them
for dim in ["variable", "x_geostationary",  "y_geostationary"]:
    assert ds1.chunks[dim] == ds2.chunks[dim] 
    print(f"{dim}: {ds1.chunks[dim]}")
variable: (11,)
x_geostationary: (100, 100, 100, 100, 100, 100, 14)
y_geostationary: (100, 100, 100, 72)
# Open the two files with our default settings
ds = xr.open_mfdataset([path1, path2],
    engine="zarr",
    concat_dim="time",
    combine="nested",
    chunks="auto",
    join="override"
)

# Check the chunk sizes are the same and print them
for dim in ["variable", "x_geostationary",  "y_geostationary"]:
    print(f"{dim}: {ds.chunks[dim]}")
variable: (11,)
x_geostationary: (300, 300, 14)
y_geostationary: (300, 72)
dfulu commented 1 month ago

~This may have been the fault of some of the coordinates not being the same between different zarr files. I am now unable to recreate the issue. So closing until it pops up again~

dfulu commented 1 month ago

Absolute whirlwind. I have recreated the issue. Added example to the description above

Sukh-P commented 2 weeks ago

Great catch! And thanks for the example, I looked at the xarray docs again which had some detail on what the different parameter values for chunks would do and I guess the chunked size with "auto" is at the whim of dask auto and what it deems ideal.

Not sure if this helpful but I recreated the example you made with smaller amount of fake data and when setting chunks=None in open_mfdataset it seemed to preserve what the original chunk sizes were. It would be good to double check that but in that case could a rule of thumb be if you have already rechunked the data you are working on to optimise for performance given your use case it would be best to avoid chunks="auto" and go with None instead and if you haven't rechunked for some reason then auto may still be a sensible choice?

Or if you were thinking of a different way altogether of opening/loading mulitple zarr files I would be interested to see what that could look like!