google-research / arco-era5

Recipes for reproducing Analysis-Ready & Cloud Optimized (ARCO) ERA5 datasets.
https://cloud.google.com/storage/docs/public-datasets/era5
Apache License 2.0
287 stars 22 forks source link

Explanation for chunking #28

Closed loliverhennigh closed 11 months ago

loliverhennigh commented 1 year ago

Hey, not a cloud expert but wondering the rational for using the chunking you have. I see that the zarr files have rather large chunk sizes. For example, the model level variables have chunk size dask.array<chunksize=(48, 137, 410240). This works out to be about 10 gigs. My understanding was that a good chunk size for object storage is on the order of MBs. Wouldn't it make sense to have chunking (1, 1, 410240) for example?

alxmrs commented 1 year ago

You're right, the current chunking scheme is quite big. It would definitely be an improvement to split each chunk per level, as you have suggested. We can definitely prioritize this improvement for working with the CO version of the data (our current focus has been towards Phase 2). I'll leave this issue open to track the work.

dabhicusp commented 1 year ago

Hello @loliverhennigh , I'm glad to inform you that I had created a pull request (#31) that addresses the current issue(https://github.com/google-research/arco-era5/issues/28 issue) which you opened. The changes in the pull request effectively solve the problem you reported.

Could you please review the pull request and, if everything looks good to you, can you marked as resolved this issue? As @alxmrs already merged this PR into main branch

Thank you for bringing this issue to our attention. If you have any further questions or concerns, feel free to let me know.

alxmrs commented 1 year ago

@DarshanSP19: once #49 lands, can we mark this issue as fixed?