save_merged_data produces loads of files

qq23840 commented 2 months ago

Perhaps a silly issue exposing my lack of understanding of zarr, but when running an inversion with save_merged_data = True, the resulting *_merged-data.pickle.zarr directory contains a huge number of files (upwards of 25k, sometimes). It's not big in terms of space, but it's meant I fairly quickly run into my filenumber limit on bp1.

I don't actually need to save the merged data, so will just set to False for now, but more generally is there a way of making these output folders contain less files? I know #92 allowed for more options in terms of the merged data, but it seems to default to this zarr format.

brendan-m-murphy commented 2 months ago

Each "chunk" in a zarr store is saved as a separate file. If you use xr.open_zarr("*_merged-data.pickle.zarr") you can see how many chunks there are (without loading the data into memory). I guess the xarray defaults might be creating a lot of chunks. I can change that.

I can make it possible to pass the output type of file used, so it could be saved as netcdf.

@gareth-j any suggestions? is this something the nested directory store helps with?

gareth-j commented 2 months ago

Is this cache something that needs to be updated or is it a one time thing? If it's just cached and then read you could maybe use a ZipStore instead.

On Wed, 1 May 2024, 17:18 Brendan Murphy, @.***> wrote:

Each "chunk" in a zarr store is saved as a separate file. If you use xr.open_zarr("*_merged-data.pickle.zarr") you can see how many chunks there are (without loading the data into memory). I guess the xarray defaults might be creating a lot of chunks. I can change that.

I can make it possible to pass the output type of file used, so it could be saved as netcdf.

@gareth-j https://github.com/gareth-j any suggestions? is this something the nested directory store helps with?

— Reply to this email directly, view it on GitHub https://github.com/openghg/openghg_inversions/issues/108#issuecomment-2088531311, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEAR3QYQB7SXWA5CBDOFKTZAD2TVAVCNFSM6AAAAABHAJKUFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBYGUZTCMZRGE . You are receiving this because you were mentioned.Message ID: @.***>

brendan-m-murphy commented 2 months ago

@qq23840 for now, maybe set "save_merged_data" to false, and if you want to hold onto the old data, make it into a zip file.

@gareth-j 's suggestion should fix this problem, but I might not have time to add this fix before next Thursday (I'm away until then, although I'm trying to fix some other inversions problems at the moment...)

openghg / openghg_inversions

save_merged_data produces loads of files #108