NetCDF4 files, commonly used for storing climate and earth systems data, are not optimized for use with most machine learning applications with heavy io requirements or datasets that are simply too large to hold in GPU/CPU memory.
It performs a preprocessing flow on climate fields and converts them from NetCDF4 (.nc
) to an intermediate file format Zarr (.zarr
) which allows for the parallel loading and writing to individual PyTorch Lightning files (.pt
) that can be loaded directly onto GPUs.
standardizing and making metadata uniform between datasets
aligns different grids perfectly by re-projecting them onto one another -- nc2pt projects the low-resolution (lr) regular grids onto the high-resolution curvilinear grids (hr). nc2pt assumes the curvilinear dimensions are like rlat
or rlon
. It was originally designed to support super-resolution problems.
selects individual years as test years or training years
organizes code into input (lr) or output (hr) fields
meant for use with large datasets ont he order of hundreds of GB
High-level workflow
.zarr
nc2pt/tools/zarr_to_torch.py
- writes to PyTorch filesnc2pt/tools/single_file_to_batches.py
- batches the single PyTorch filesThe most obvious downside is that you lose the metadata associated with a netCDF dataset. The intermediate Zarr format produced by nc2pt allows for parallelized io and perserves the metadata. This is useful for inference.
xESMF is only available through Conda, so you will have to be able to install conda on your system. Unfortunately, this is limiting because certain HPCs don't allow conda.
Begin by install xESMF here in a conda environment: xESMF
Clone this repository
Install into your conda environment
conda install -c conda-forge pip
pip install -r requirements.txt
# editable install
pip install -e nc2pt/
That's it!
nc2pt uses hydra for configuring and by instantiating structured classes in nc2pt/climatedata.py
. This simeultaneously defines the workflow as well as the data. Please see nc2pt/conf/config.yml
for an example configuration, or below:
_target_: nc2pt.climatedata.ClimateData # Iniatlizes ClimateData dataclass object
output_path: /home/nannau/data/proc/
climate_models:
# This lists the models
- _target_: nc2pt.climatedata.ClimateModel
name: hr
info: "High Resolution USask WRF, Western Canada"
climate_variables: # Provides a list of ClimateVariable dataclass objects to initialize
- _target_: nc2pt.climatedata.ClimateVariable
name: "tas"
alternative_names: ["T2", "surface temperature"]
path: /home/nannau/USask-WRF-WCA/fire_vars/T2/*.nc
is_west_negative: true
apply_standardize: false
apply_normalize: true
invariant: false
transform: []
- _target_: nc2pt.climatedata.ClimateModel
info: "Low resolution ERA5, Western Canada"
name: lr
hr_ref: # Reference field to interpolate to. Will need to provide new file if not using USask WRF
_target_: nc2pt.climatedata.ClimateVariable
name: "hr_ref"
alternative_names: ["T2"]
path: nc2pt/nc2pt/data/hr_ref.nc
is_west_negative: true
climate_variables:
- _target_: nc2pt.climatedata.ClimateVariable
name: "tas"
alternative_names: ["T2", "surface temperature"]
path: /home/nannau/ERA5_NCAR-RDA_North_America/proc/tas_1hr_ERA5_an_RDA-025_1979010100-2018123123_time_sliced_cropped.nc
is_west_negative: false
apply_standardize: false
apply_normalize: true
invariant: false
transform:
- "x - 273.15"
dims: # Defines the dimensions you might find in your lr or hr dataset and lists them to be initialized as ClimateDimension objects. Typically this would match what is in your hr dataset. Intended to allow for renaming of dimensions and allows for the control of chunking
- _target_: nc2pt.climatedata.ClimateDimension
name: time
alternative_names: ["forecast_initial_time", "Time", "Times", "times"]
chunksize: 100
- _target_: nc2pt.climatedata.ClimateDimension
name: rlat
alternative_names: ["rotated_latitude"]
hr_only: true
chunksize: -1
- _target_: nc2pt.climatedata.ClimateDimension
name: rlon
alternative_names: ["rotated_longitude"]
hr_only: true
chunksize: -1
# similar to dims, just as coodinates instead. coordinates might not match dims on curvilinear grids
coords:
- _target_: nc2pt.climatedata.ClimateDimension
name: lat
alternative_names: ["latitude", "Lat", "Latitude"]
chunksize: -1
- _target_: nc2pt.climatedata.ClimateDimension
name: lon
alternative_names: ["longitude", "Long", "Lon", "Longitude"]
chunksize: -1
# subsample data temporally or spatially!
select:
# Time indexing for subsets
time:
# Crop to the dataset with the shortest run
# this defines the full dataset from which to subset
range:
start: "20001001T06:00:00"
end: "20150928T12:00:00"
# start: "2021-11-01T00:00:00"
# end: "2021-12-31T22:00:00"
# use this to select which years to reserve for testing
# and for validation
# the remaining years in full will be used for training
test_years: [2000, 2009, 2014]
validation_years: [2015]
# test_years: [None]
# validation_years: [None]
# sets the scale factor and index slices of the rotated coordinates
spatial:
scale_factor: 8
x:
first_index: 110
last_index: 622
y:
first_index: 20
last_index: 532
# dask client parameters
compute:
# xarray netcdf engine
engine: h5netcdf
dask_dashboard_address: 8787
chunks:
time: auto
rlat: auto
rlon: auto
# optional for tools scripts (single_files_to_batches)
loader:
batch_size: 4
randomize: true
seed: 0
nc2pt/conf/config.yaml
nc2pt/preprocess.py
script which will run through your preprocessing steps. This creates the zarr filesnc2pt/tools/zarr_to_torch.py
script which serializes each time step in the .zarr
file to an individual PyTorch .pt
file.nc2pt/tools/single_files_to_batches.py
which combines individual files from the previous step into random batches. This setup allows for less io in your machine learning pipeline.Testing is done with pytest. The easiest way to perform tests is to install pytest and use the command: pytest --cov-report term-missing --cov=nc2pt .
It will generate a coverage report and automatically use files prepended with test_*.py
in nc2pt/tests