This module contains the CFDataset
class, which extends the netcdf4.Dataset
class to provide additional properties, memory-efficient data access, and improved
error handling for netCDF files that comply with the CF Metadata Conventions
and the PCIC metadata conventions that extend them.
It supports several PCIC tools that work with netCDF files that adhere to the CF and PCIC metadata conventions. The class provides several properties that specify information about a file's contents and metadata and can be used to guide data processing. It does not provide any new tools to directly modify netCDF files, but all file-modifying procedures in the netcdf4.Dataset class are still available.
iteration.py
contains generators for iterating over a netCDF file and loading
on chunk at a time so that enormous files can be read without a MemoryError
.
PCIC has a process-oriented metadata model.
Data originates as either model output (simulated by a Global Climate Model or Regional Climate Model) or observations (measured directly in some fashion).
The data can then be used as input to one or more further processes. Each new process preserves all the metadata describing the data origin and previous process. When a new dataset B is generated from a process that uses dataset A as input, all metadata attributes describing A's generation will be present in B, prepended with a prefix that refers to A's role in generating B.
For example, suppose you have a model output dataset A, with a metadata attribute
giving the name of the generating model, example_model
.
A has the metadata attribute model_id
with the value example_model
.
If A is used as input to a downscaling process, the output dataset B will
have an attribute called GCM__model_id
with the value example_model
. A
is used as the GCM (global climate model) intput to the downscaling process,
so the prefix GCM
is used.
If B is further used as input to a hydrological modeling process, the output
dataset C will have an attribute called downscaling__GCM__model_id
with the
value example_model
. B is a downscaled dataset used as forcing data for the
hydrological model, so its attributes are prepended with downscaling
, including
the attributes it inherited from A to show its own inputs.
The metadata preserves the entire chain of processes followed to create any given dataset so that its origin can always be traced and recreated.
The functions in this module handle determining what sort of data a particular netCDF is, which processes were used to generate it, validating that required metadata is present, and navigating the metadata "tree" to find desired metadata.
Most of the time, this module will take care of the low level details related to handling various types of datasets. Data is usually cubes with a latitude, longitude, and time dimension. While it may have different origins and different origin- or process- specific metadata, the module should seamlessly traverse the metadata formats of various different data types and provide a unified interface to accessing needed metadata.
Model output is the majority of netCDF data used by PCIC. Model output data has latitude, longitude, and time dimensions and metadata attributes specifying the model, scenario, and run used to generate the data.
Model data that has not been further processed has the is_unprocessed_gcm_output
property of True
. Data that is either model output or was created by processes
that used model output has the is_gcm_derivative
property of True
.
Observation data is historical data that is derived from real world observations and then extrapolated to cover geographic or chronological gaps by an algorithmic process. (This module and the netCDF file format are not well suited for handling sparse, non-gridded observation data.)
Note that, confusingly, observation data usually does have a model_id
attribute:
typically this is the name of the algorithm used to extrapolate measurements to
cover an entire grid. It is not a Global Climate Model, though, and simulation
attributes relevant to GCMs, like experiment
, will not be present.
Observational data values usually, but not always, takes the form of a cube with lat, lon, and time dimensions, similar to model output.
Observation data has the is_gridded_obs
property of True
.
This process produces data with a higher spatial resolution, but otherwise similar to the input data. It is only run on model output data; observation data is already downscaled by the extrapolation process used to create it.
It will have the property is_downscaled_output
of True
and metadata
specifying the downscaling algorithm (typically either BCCAQ, PRISM, or both).
This process takes model output and calculates various derived statistics about it. The output data will have the same dimensions as the input data (lat, lon, time), but a different variable.
All climdex datasets have the property is_climdex_output
set to True
, and
one of is_climdex_gcm_output
or is_climdex_ds_gcm_output
will be True
as well, depending on whether the input dataset was downscaled or not.
Unlike Downscaling or Climdex calculation, hydrological modeling produces data that is not a cube with lat, lon, and time dimensions, and applications that use this module to work with streamflow data will definitely need to check whether the data is streamflow and handle it seperately if so.
The hydrological model takes a downscaled model output or gridded
observation dataset as input, and outputs streamflow at one particular
location. The resulting dataset has a True
is_streamflow_model_output
property.
The most common type of PCIC data is a raster timeseries. Data is stored in one or more data cubes with latitude, longitude, and time dimensions. This is the default and doesn't usually require explicit handling, but can be checked for if needed.
The sampling_geometry
property will have the value gridded
and the time_invariant
property will be False
.
A subset of raster timeseries; a climatology contains values that are averaged over a
multi-year time period, typically 30 years. Climatologies may contain annual data
(one timestamp), seasonal data (four timestamps), monthly data (12 timestamps) or
some combination of those time resolutions. For example, a January timestamp would
represent the average of all Januaries occuring over the time period.
It has a climatology_bounds_value
property specifying the period over which each
value is averaged.
A climatology will return True
on the is_multi_year
property.
Discrete Structured Geometries have a time series of data associated with
one or more specific points (like measuring stations), but not a full grid.
The collection of individual points is the "instance" dimension; data is
stored in a rectanlge with dimensions corresponding to "instance" and "time".
It has an instance_dim
property and an id_instance_var
property in
accordance with the CF Standards for DSG data. The list of instance variables
is available in the coordinate_vars
property.
A discrete structured geometry has a value other than gridded
as its
sampling_geometry
property.
Time invariant data is gridded data that describes characteristics that do not change over time, like elevation or soil type. Time Invariant Data is always observations; climate model output necessarily has a time component. It lacks a time dimension.
A time-invariant dataset returns True
on the is_time_invariant
property.
Most time-related properties will throw errors if accessed on a time-invariant
dataset.
While this module is usually imported to some other project, it can be built and tested on its own for debugging or development.
git clone http://github.com/pacificclimate/nchelpers
cd nchelpers
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt -i https://pypi.pacificclimate.org/simple/
pip install .
Tests can be run with pytest
.