openclimatefix / ocf-data-sampler

A test repo to experiment refactoring ocf_datapipes
MIT License
2 stars 3 forks source link

Load ICON data from HF #66

Open peterdudfield opened 1 month ago

peterdudfield commented 1 month ago

Detailed Description

It would be great to be able to load the ICON data from HF in our nwp open dataset

Context

Possible Implementation

gabrielelibardi commented 6 days ago

@peterdudfield I'd like to have a look at this, could you give me permission to create a branch

peterdudfield commented 6 days ago

Hi @gabrielelibardi , we actually paused development on this in ocf_datapipes. But this could be done in ocf-data-sampler. Would you like to try it there?

gabrielelibardi commented 5 days ago

I managed to run the pvnet_datapipe on some of the icon-eu huggingface data that I downloaded locally (one day worth of data). This needs some change in the code but I think the problem that you refer to in this issue is different, this seems to be independent of the postprocessing done in ocf_datapipes and just about this line ds = xr.open_mfdataset(zarr_paths, engine="zarr", combine="nested", concat_dim="time"). If zarr_paths is all the paths to each .zarr.zip file on hugging-face that ds is never initialized as it just takes too long. I presume this is because of all the metadata that need to be downloaded from each .zarr.zip file. Once the ds is initialized then the data will be loaded lazily as you create the batches. Maybe caching the metadata locally could speed things up. Do I understand this correctly? @peterdudfield

peterdudfield commented 5 days ago

Thanks, great you managed it with one day.

hmmm, interesting. The meta is normally quite small.

Does it scale linear with the number of files you provide in xr.open_mfdataset? Like can you load from 2 data files from HF quickly?

I have seen before if they are differnet shapes, then xr.open_mfdataset can take a long time, as it tries sort these out

gabrielelibardi commented 5 days ago

It definitely gets slower with more files. Here is a flamegraph svg for the profiling. profile. I am getting a year worth of hugging face .zarr.zip files and trying to make a xarray from the first 20. Most of the time is spend in requests to the HF server. If I try to make an xarray with too many paths eventually it pisses the hf server huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/openclimatefix/dwd-icon-eu.

peterdudfield commented 5 days ago

Thanks for doing this, and nice to see the profile. Whats the current timings for making an xarray with the first 20? You totaly sure its only getting the metadata, not downloading all the data?

Do you know roughtly how many paths until you get a 429?

gabrielelibardi commented 5 days ago

It takes 58 secs to create an xarray with 20. Pretty sure it is not downloading the data (that would be more than 60 GB in 60 secs a bit too crazy for my laptop). I think I was trying 50 or so and I was already getting a 429. Maybe it is because the .zmetadata are too small so maybe there are a lot of separate requests too quickly, if I download a whole .zip.zarr file it does not complain.

gabrielelibardi commented 5 days ago

For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server?

peterdudfield commented 5 days ago

For training with ECMWF data you also use multiple zarr file or just one? You stream it from S3 buckets or keep it on the training server?

For training ECMWF, we tend to join our daily ECMWF files into year (or monthly) files and then open multi open zarrs. We tend to try to keep the data close to where the model is running, so either locally if trianing localy, or in cloud, if we are training in the cloud. It would be nice to stream from HF though, so we dont have to re organise the files, and the live data is available.

Thanks for these benchmark figures. Yea I agree it's just downloading metadata not the whole thing.

Feels like there should be some caching we can do, so for example, loading metadata for each month one, and then quickly loading this new file in the future. I really dont know the solution for this sorry.

@devsjc @Sukh-P @AUdaltsova might have some ideas?

jacobbieker commented 5 days ago

You should be able to do something like kerchunk or virtualizarr to save the metadata into one file and open that. That way you wouldn't need all the requests for the metadata. Getting the data is still then limited by HF request limits, but is at least then less of an issue.

gabrielelibardi commented 3 days ago

Thanks a lot @jacobbieker @peterdudfield! I haven't tried it yet but looks promising. For now I put 1 month of data on a cloud storage (something like S3). This solves the problem with the 429 error. I tried to run the script/save_batches.py though and it is impractically slow (1 sample every 3 secs or so). It is not a problem for me to self-host the icon dataset (or parts of it) but I would like to keep large amounts of data off the training instance as we would spin this one on and off. I would be curious to know in your case for the training of PVNet how long did you need to create the batches for the training data and when you were training on the cloud what solutions worked best for you.

peterdudfield commented 3 days ago

Thanks @gabrielelibardi

Is this slow running speed when you load from s3? Where are you running the code from, an ec2 instance?

We tend to get a speed of more like 1 batch every 1sec ish (I would have to look up the batch size). So it can still take a day or so to make batches. We get the fastest results by having the data very near i.e locally or on a disk attached to an vm. This is more set up, but faster. Using Multi-processing helps too, I think thats in the making batches script already.