Readers for CERRA, ERA5 and IMERG data

mlangguth89 / downscaling-cgan

Clean, easier to use version of the downscaling cGAN

MIT License

0 stars 0 forks source link

Readers for CERRA, ERA5 and IMERG data #2

Closed mlangguth89 closed 1 month ago

mlangguth89 commented 3 months ago

Add functions to the data preprocessing (that is used to create the TFRecords files that are streamed during training) to read/process the ERA5 input data and the CERRA target data. As both datasets are avialable in monthly grib-files, the data processing will also be changed. So far, an iterator is used to write the dataset into TFrecords, where each sample involves an I/O-process (i.e. opening the file, getting the data, closing it). this produces a lot of I/O-overhead that can be avoided with the monthyl files, and thus, the related write_data-function as well the DataGenerator will be adapted accordingly.

mlangguth89 commented 2 months ago

The predictor variables for downscaling (e.g. CAPE, convective precipitation, etc.) are not used in AtmoRep and therefore required downloading the data with the CDS API. For convenience, the data has been downloaded in netCDF-format and is available under /p/scratch/atmo-rep/data/era5/new_structure/. The load_era5_monthly-method is however designed to handle both.

mlangguth89 commented 2 months ago

Closer inspection of the script and methods under the dsrnngan/data-directory reveals several issues, e.g.

missing handling of the hour-parameter in write_data
no patching/chunking before writing TFRecords for training (this is done in the original code, cf. here; patching can be done on-the-fly, but categorization on the rainy grid points fraction becomes incorrect the for large domains (which will typically have small fractions, albeit individual patches may have high fractional values)
categorizing samples in rain classes - two issues are present here: the retreived index based on np.digitize misses an index-shift, the rainy grid points draction is calculated on the full-domain here (see above)

mlangguth89 commented 1 month ago

Training data has been successfully preprocessed in the last weeks, which should (hopefully) enable training:

/p/scratch/atmo-rep/data/downscaling/downscaling_tfrecords/training_data/0aad51a8f3848213

Integration of the validation dataset is still open, and will probably be realized via TFRecords again for efficiency (with no patching of the data). However, the corresponding adaptations will be performed in a seperate issue-branch.