openclimatefix / skillful_nowcasting

Implementation of DeepMind's Deep Generative Model of Radar (DGMR) https://arxiv.org/abs/2104.00954
MIT License
211 stars 59 forks source link

Loading BOM data into ram #41

Open primeoc opened 1 year ago

primeoc commented 1 year ago

Hello,

I am trying to train this model on Australia's BOM Radar Data, however, I am having trouble loading the data into memory.

I have 1 year's worth of data in netCDF4 format at time steps of 5 minutes. Each time step is a seperate NC file. The file structure to access the precipitation field at 01/01/2022 at 12:30pm would be: BOM Rain Rate Data 2022 (fodler) > 20220101 (folder) > 20220101_123000.nc (the precipitation field is stored as an array of int64 values in mm/h under a variable called 'rain_rate' in the netCDF file). I have tried the netCDF4 and xarray libraries for python and recieve an OOM error.

The problem is if I was to load the all available 2022 data (~300 days), it would require approximately 180 GB of ram, which I do not have. The netCDF must compress the data as the size of 2022 data on disk is ~5 GB.

How would I go about efficiently loading all this data and passing it into the DGMR?

Thanks for your help.

jacobbieker commented 1 year ago

Hey, sorry for the delay, I just missed this issue. I wouldn't load it all into ram at once. For training, we lazily load the data we need from either the UK Nimrod dataset or US MRMS, so only have the small examples in memory at any given time. We tend to use Zarr, and xarray, which work fairly well for doing that, but yeah, not loading it all into memory at once.

peterdudfield commented 1 year ago

@all-contributors please add @primeoc for question

allcontributors[bot] commented 1 year ago

@peterdudfield

I've put up a pull request to add @primeoc! :tada: