openclimatefix / ocf_datapipes

OCF's DataPipe based dataloader for training and inference
MIT License
13 stars 11 forks source link

Test out Xarray Tensorstore Backend #198

Open jacobbieker opened 1 year ago

jacobbieker commented 1 year ago

https://github.com/google/xarray-tensorstore

Detailed Description

We have been trying to speed up access from zarr for quite awhile. Tensorstore might help, and Google recently made public a backend for xarray that uses Tensorstore.

@JackKelly

JackKelly commented 1 year ago

SGTM!

Relevant links:

And reasons to believe that TensorStore might be faster than zarr-python:

assafshouval commented 1 year ago

I would like to take this issue. I'm new here. how these stuff works? Where to start look at?

jacobbieker commented 1 year ago

Hi! That's great! For this, the primary place that things would need to be updated would be the files in ocf_datapipes/load, as this issue is primarily concerned with opening up and reading the Zarrs with tensorstore. The two main ones you would want to look at would be https://github.com/openclimatefix/ocf_datapipes/blob/087149831f785903b600ae10540895ff3b6e511c/ocf_datapipes/load/satellite.py#L56 for satellite data, which is available on GCP here if you want some more to test with, other than the unit tests which include a bit of satellite data. And for NWP data sources, primarily the ICON data: https://github.com/openclimatefix/ocf_datapipes/blob/087149831f785903b600ae10540895ff3b6e511c/ocf_datapipes/load/nwp/providers/icon.py#L8 We have an archive of ICON data in Zarr format on HuggingFace here where you can download an example or two to try things out. Although there is also some data included in this repo for the unit tests, incase you want to try with that instead too.

Ideally, it should be possible to just swap out the xr.open_zarr with xarray_tensorstore.open_zarr with minimal changes. There are some caveats though:

  1. From some initial testing, TensorStore possibly doesn't work very well with our compression algorithm ocf_blosc2 @dfulu might have more info
  2. We don't want to use the xarray-tensorstore.read() until we have cropped and picked out examples to use, otherwise it seems like it might read into memory the whole dataset, which for most of our data sources is multiple TBs.
assafshouval commented 1 year ago

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

jacobbieker commented 1 year ago

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

Hi, thanks for looking into this! If you can open the zarr, which you can since you got to that line, the dependencies should not be the problem then, I don't think. The TensorStore implementation might not be mature enough for this then? But not really sure. I thought xarray-tensorstore uses Zarr-python for the metadata, so the sorting should work with that, but yeah, sorry its not much help.

assafshouval commented 1 year ago

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

I thought xarray-tensorstore uses Zarr-python for the metadata, so the sorting should work with that, but yeah, sorry its not much help.

Yeah, they do use zarr-python, but the problem is when trying to deep copy the dataset. I'll invest a little more time exploring this, and see if I can advance this further, if not I'll leave it for now. Thanks

shoyer commented 1 year ago

@jacobbieker, just updating that I've tried to switch to xarray_tensorstore.open_zarr(path), but I've got the following problem in the line after when sortby('time'). dataset = dataset.drop_duplicates("time").sortby("time") I've encountered the same problem and opened an issue here: https://github.com/google/xarray-tensorstore/issues/1#issue-1855187534 I'm wondering whether I don't have the dependencies right...

This should be fixed in the latest 0.1.1 release of Xarray-Tensorstore.

assafshouval commented 12 months ago

indeed, it works. thanks @shoyer. @jacobbieker I have more questions: a. Do you have a good example that would beneficial to bench-mark it? b. It doesn't support opening multi-file dataset as in: dataset = xr.open_mfdataset(zarr_path, **openmf_kwargs), for example in sattelite.py, line 45. I don't how much this scenario is worth investing time in, but first maybe worth to benchmark the first case ...

jacobbieker commented 12 months ago

Hi, yeah, a good benchmark would be a single satellite zarr, like this one: gs://public-datasets-eumetsat-solar-forecasting/satellite/EUMETSAT/SEVIRI_RSS/v4/2023_hrv.zarr

And okay, thanks, yeah, if it is a lot faster, we can probably fine a workaround to that issue.

jacobbieker commented 9 months ago

Update from my testing: Tensorstore does not support compressors not on this list https://google.github.io/tensorstore/driver/zarr/index.html#json-driver/zarr/Compressor and so can't open most of the OCF Zarrs which are compressed with Blosc2

peterdudfield commented 9 months ago

Should we close this?

jacobbieker commented 9 months ago

I think we should leave it open for now, I'll close the issues related to adding support though. @dfulu was going to add a small example notebook and such showing the testing results to have it recorded here.