Standardizing and sharing n-dimensional array hydrometric datasets as inputs for XHydro

hydrologie / xhydro

Hydrological analysis library built with xarray.

https://xhydro.readthedocs.io

Apache License 2.0

24 stars 4 forks source link

Standardizing and sharing n-dimensional array hydrometric datasets as inputs for XHydro #17

Closed sebastienlanglois closed 11 months ago

sebastienlanglois commented 1 year ago

Addressing a Problem?

While netCDF/zarr and CF standards are widely used for storing and exchanging n-dimensional arrays in the realm of climate sciences, there is presently no comparable standard or specification in place for n-d array hydrometric data (WaterML exists but it consist of XML files and still require lots of processing to use with modern python stack).

Furthermore, as we report to diverse organizations, each entity already has its own unique methods for organizing and sharing hydrometric data (ex: miranda for Ouranos).

To foster collaboration, facilitate development, enable rigorous testing with real data, and enhance reproducibility of studies conducted through Xhydro, substantial benefits can be gained by standardizing hydrometric data and ensuring its universal accessibility on the internet through open-source means wherever feasible.

More specifically, this would involve:

Establishing a specification (using xarray) for representing hydrometric data in n-dimensional arrays.
Implementing a continuous updating process for hydrological datasets through DataOps practices.
Storing the data in high-performance servers, accessible to all stakeholders, such as a cloud-based data lake.
Creating a data catalog, leveraging tools like Intake, to facilitate easy querying of the catalog and retrieval of requested data using lazy loading or in-memory.
Enabling the execution of common operations (e.g., filters, selections, geospatial operations) to enhance data analysis capabilities and productivity.

While it may appear as a significant undertaking, I have already dedicated several months to implementing a solution, drawing upon the advancements achieved in PAVICS/PAVICS-Hydro. I am excited to present what I have so far and seek valuable feedback from experts in the field.

Potential Solution

Here is a simplified overview of the solution currently being developed, which follows a similar approach to accessing large-scale climate data as described in this GitHub issue:

Data sources : We have implemented multiple data pipelines for the continuous acquisition of data (mostly daily)
Data lakes : The data is stored in Wasabisys high-bandwidth data lakes. Unlike other cloud providers, data in Wasabisys can be extracted from the cloud (either on a local machine or to other cloud providers) at no cost (no egress or API requests fees), which offers a significant advantage in making the data truly accessible and open-source.
Data catalog : Intake is utilized to create a data catalog that abstracts all configuration and data access drivers for users, ensuring seamless data access for the end user. Currently, the data catalog resides here (docs). Thanks to the integration of intake plugins, we have the capability to directly reference datasets from various sources, including the Planetary Computer and the majority of datasets available in PAVICS from its THREDDS Data Server.
Xdatasets : While it is possible to access data directly through the catalog, utilizing the xdatasets library (docs) provides additional capabilities to perform common operations on multiple sites and datasets simultaneously, such as selection, filtering, spatial selection, resampling (spatial and temporal), data averaging, and more. It becomes a one-stop shop for all hydrometric datasets access and preparation. (Note for Ouranos's team: Initially, we employed clisops for certain operations (ex: weighted spatial averaging); however, due to occasional instability and slow performance attributed to xesmf, we ultimately switched to using xagg instead. Our tests have shown that xagg yields the same results (for weighted spatial averaging) but with significantly improved speed and stability.)

Here is an example for an actual study that we are working on right now. The requirements are:

Streamflow data from DEH
Only select hydrological regions 03 to 06
Start date is 1970-01-01
At least 15 years of data
Natural regime or influenced (daily) only

This can be achieved simply with the following query by leveraging xdatasets's capabilities :

Below is the list of retrieved data that can be easily viewed :
Screenshot from 2023-06-17 00-50-16

The hydrometric data specification presented above is the result of extensive deliberation and collaboration with @TC-FF drawing from our real-world experience of utilizing this kind of data.. Through this process, we have determined that this format enables the representation of a wide range of hydrometric data types (flow rates, water levels, basin-scale or station-specific weather data), at various time intervals, with different temporal aggregations (maximum, minimum, mean, sum, etc.), spatial aggregations (such as a point (outlet or station) or polygon (basin)), and includes information about the data source. We are seeking valuable feedback on the proposed data specification for representing hydrometric datasets, including suggestions for improved variable naming, adherence to conventions, and potential modifications to the data model itself. This can include for example adding timezones info, time bounds, etc. Your input on these aspects would be greatly appreciated.

Also note that, we intend to have approximately 20 000 daily-updated gauged basins in xdatasets with precomputed climate variables at each basin (temperatures, precipitation, radiation, dew point, SWE, etc.) from different sources (ERA5, ERA5-Land, Daymet, etc.) by the end of July. To retrieve the additional variables, one will simply need to include them in the query. The majority of basins are located in North America, with additional regions worldwide utilized for training deep learning algorithms. For this, we build upon the work already accomplished in HYSETS and CARAVAN, but our focus is on making it operational and easily queryable.

Additional context

There is much more details to be said regarding the various components of the presented solution. Additionally, xdatasets offers a broader range of capabilities (such as working with climate datasets such as ERA5 directly) than the simple example presented here, with even more ambitious plans in the roadmap. However, considering the length of this post, I will conclude here to let you absorb all the details. If you have any questions or suggestions, please don't hesitate to reach out.

Contribution

[X] I would be willing/able to open a Pull Request to contribute this feature.

RondeauG commented 1 year ago

(P.S. On s'était donné le droit d'y aller en français dans les Issues, si tu préfères !)

I don't think that the goal should be to fully standardize how we get to the data (aka Data lakes --> Data catalog), since I don't think that we'll be able to accomplish that between several organizations. However, what we can and should standardize is what the data looks like on the xarray level (Data sources & the output of Xdatasets). Xhydro should be able to work on anything, including local, custom data, as long as it is formatted adequately.

So in regards to that, I have a few questions/comments on that DEH dataset:

Why is variable a dimension? xarray already has a ds.data_vars property, so it does not need a dimension for that. Or is it used for something that I'm not getting?
How is spatial_agg used? I feel like outlet vs station vs polygon/basin is too big a difference (messes up id, for example) to be a dimension? Those should probably be different datasets altogether, and that information relegated to a global attribute.
Same thing with timestep. I don't see a world where we can have multiple timesteps (for example: daily & 1h & 15min) in the same dataset. You'd create a lot of wasted space in the file, while rendering tools such as xr.infer_freq() or xclim inaccurate.
What you have under time_agg looks like xclim's ensemble statistics. It's probably not the best way to do it, but we create new variables (streamflow_max, streamflow_min, etc.) instead of a new dimension. Same goes for some standardized indicators such as tg_max. The one exception is with percentiles, which are given a dimension.
I see that the global attributes are empty. Unless you are dealing with multiple data sources, I'd put a lot of info in there (institution, source, frequency, licence, disclaimers, history, etc.). You can look at CMIP or CORDEX data to have an idea of the kind of information that usually goes in there.
We can't see the variable attributes here, but I'd follow CF standards as much as possible.
Name differences to be discussed:
- id --> site (clisops) | station (Hydroclimatic Atlas)
- source --> realization (xclim). CMIP6 controlled vocabulary does use source to describe the model name, but realization covers more (member, experiment, driving model, etc.)
- latitude/longitude --> lat/lon (in pretty much everything). At the very least, we want to be compatible with cf-xarray.
- timestep --> frequency (CMIP6 controlled vocabulary). This is to distinguish between the "model/observation timestep" and the frequency of the data that is currently being looked at.
- streamflow --> discharge (xclim). As discussed a while ago, this is something that we could change in xclim if there is too strong of a connotation to "regulated flow" with "discharge".
- I also recall seeing things like t2m in one of your previous examples, because that's how ERA5-Land names their surface temperature. However, an issue here is that no convention exists for reanalysis data, so we end up having to manage a bunch of different names for the same variable. As much as possible, I'd stick to CMIP names (tas, in this example), since this is what xclim expects to receive.

As for some of your other questions:

cf-xarray has a lot of nice goodies that can be called directly with xarray by using ds.cf. If a dataset lacks time bounds, but has a correctly formatted time dimension, you can use ds.cf.add_bounds("time"). Ideally, we want our coordinates to be fully understood by cf-xarray.
I've never used timezones, but I know that pandas (and thus probably xarray) can understand them. It's probably just a matter of writing the right attribute(s) into the netCDF/Zarr.

sebastienlanglois commented 1 year ago

Thanks @RondeauG for your input! It's super valuable. I'll make my best to answer each questions/comments and we can discuss them further at our next meeting if required.

Data catalog

I agree that the main objective of this issue is to standardize hydrometric data at the xarray level.

Making sure that each organization's own local data can work with xhydro/xdatasets is really important. That being said, I believe it would be beneficial for xdatasets to support both local data and hosted open-source datasets (through a data catalog) once we define the xarray specifications for hydrometric data. This would provide users with the flexibility to access various data sources from wherever they are. However, I suggest that we create a separate issue to address this specific feature and refrain from prioritizing it at the moment.

Context

Before diving into each question, I'd like to provide a clearer context regarding the rationale behind certain design choices we have made. In an ideal scenario, as hydrologists, we would have access to combined hydrometric data (such as streamflow, water levels, etc.) from multiple providers, alongside relevant auxiliary data (like weather inputs (basin averaged or at a station), geographic data, etc.), all conveniently available in one centralized location. Unfortunately, in the current state, we are required to process data from each provider individually and manually combine them for each study, which proves to be an exceedingly time-consuming task. Keeping this in mind, we have attempted to devise a format that accommodates most of this data while still maintaining a reasonable level of ease when working with it.

However, it's true that the data format currently involves numerous dimensions and so we need to find the right balance between having too many dimensions vs. too many different separate datasets that end up being join together for some use cases.

Additionally, we should perhaps consider distinguishing between what is available in xdatasets and what we specifically require to be returned for xhydro. For instance, while xdatasets may offer the flexibility to choose from various timesteps, we might required xhydro to be limited to a single timestep (per object/input dataset). So xhydro could use only a subset of the capabilities offered by xdatasets.

Questions

1: It is used for the start_date and end_date coordinates because they can vary from one variable to the other. However, it is probably redundant as a dimension for the variables themselves and we should probably remove them.

2: Consider this as an edge case scenario. Let's imagine a situation where you have a weather station located at the outlet, sharing the same identifier as the hydrometric station. While this may be uncommon for public data, it sometimes occur with our own data. In such cases, the spatial_agg dimension becomes essential for distinguishing between data specific to the station itself (e.g., point measurements like precipitation) and weighted averages across the entire basin (e.g., derived from precipitation grids). In my opinion, this particular dimension is the one that can be most readily eliminated.

3, 4. I feel like timestep (or frequency) and time_agg go hand in hand. If we decide to remove the timestep dimension, we could incorporate the information as a variable attribute, but this would restrict us to having only one timestep per variable. Additionally, in such a scenario, it would be preferable to have a consistent frequency throughout the entire dataset. As mentioned before, xdatasets may support the timestep/time_agg dimensions, but for xhydro, we could limit it to a single timestep, resulting in the elimination of these dimensions.

When it comes to file size, it is true that having NaNs in the dataset would result in some space loss. However, given the relatively small size of the datasets and considering appropriate chunking, I don't believe this would be a significant concern. In the event that it does become problematic, we can explore the option of utilizing sparse to transform the dataset into a sparse array, as demonstrated in this sparse array example

5 : The DEH was an example, but it's more likely that we'll be concentrating on consolidated datasets that bring together multiple data sources. Nonetheless, I totally agree that we should figure out a way to include more attributes in those datasets.

6 : I agree, will do.

7 : I agree we should discuss about these and use conventions wherever possible. I'll make some preliminary changes and we can discuss about it further later on.

Should we set up a meeting to talk about this ? I feel like there is a lot to cover!

RondeauG commented 1 year ago

Thanks for the context. As you say, I think that we can design xhydro on only a subset of the capabilities offered by xdatasets, then make it more complex only when it is required. Your #2, for example, might be something that affects the hydrologist, but doesn't really impact the computations in xhydro.

It is used for the start_date and end_date coordinates because they can vary from one variable to the other.

That's indeed a tough one to implement cleanly. I can ask my colleagues to see if they have an idea. An alternative would be to compute that information as needed, since the start/end date isn't perfect either for station data (since you can sometimes have multiple missing years in-between the start/end date).

When it comes to file size, it is true that having NaNs in the dataset would result in some space loss. However, given the relatively small size of the datasets

The Atlas files can get pretty big! However, a bigger issue is that we will probably want to implement health checks on the data (such as using xclim's missing value identification), and combining multiple frenquencies in the same dataset would completely break those chekups.

sebastienlanglois commented 1 year ago

Dimensions to instantiate an xhydro object

So let's say we were to keep dimensions at a minimum for xhydro. In that case, our xarray data model would need, at a minimum, the following dimensions:

time
id (or sites)
flag (was missing in the example, but is quite important)
data_source (or source or ...) (as in the provider of the data)

Some xhydro modules would require additional dimensions (such as adding lat/lon if we want grids at specific basins (mask)). xdatasets can already handle that use case and we could use intake-esm/xscen for climate scenarios. I believe we will mostly figure these cases out as we go and adapt xdatasets/xscen consequently!

Coordinates

That's indeed a tough one to implement cleanly. I can ask my colleagues to see if they have an idea. An alternative would be to compute that information as needed, since the start/end date isn't perfect either for station data (since you can sometimes have multiple missing years in-between the start/end date).

It is not clear yet if all these coordinates (such as start_date and end_date) should be optional or not. However, if they are available, they would greatly enhance the efficiency and speed of calculations, especially when using a lazy loading approach. For example, users could easily filter stations based on criteria such as "all stations with at least 15 years of records" before loading them into memory. Without these coordinates, all stations would need to be loaded in memory beforehand.

To address the issue of stations missing records between start_date and end_date, we could potentially introduce a coordinate that indicates complete vs. incomplete years, similar to how HYDAT handles it. It would be fantastic if we could find a way to make xhydro work efficiently by leveraging the available coordinates to speed up calculations, while still providing the necessary functions even when those coordinates are not provided (at the cost of extra calculations). Furthermore, I'm definitely interested in exploring alternative approaches to handling the start_date and end_date scenario. If there's a better way of accomplishing this, I'd be excited to learn about it.

Modules

So, with these dimensions in place, we could create an xhydro object. If a user needed multiple timesteps or wanted to aggregate data over time, they could freely create as many xhydro objects as they would need.

Then, depending on the specific modules employed in xhydro, it is likely that additional dimensions would be incorporated. For instance, if simulations are involved, a dimension such as "realizations" might be added such as what you suggested. Similarly, for frequency analysis, a dimension like "return_period" could be included and so on and so forth.

RondeauG commented 1 year ago

I think we're getting pretty close to something!

Small question, but is flag a dimension or a variable? I've ususally seen it as a variable.
I would definitely favor a coordinate that indicates which years are available, since "all stations with at least 15 years of records" can be faulty with only the start/end dates (again, because of missing data in-between).
- Multiple methods have been coded in xclim and could be leveraged here: no missing data allowed, maximum number of missing data per given period, WMO rules, etc.
- A potential solution to computation costs could be to use the coordinate if it already exists, but compute it if it is missing or if the user forces the computation to occur.

RondeauG commented 1 year ago

On a similar note, I've started to modify the raw Hydrotel outputs. This is my current work-in-progress. As far as I know, spatial coordinates such as lat/lon are added somewhere later. I'll also have to further change id and the global attributes, but in an Atlas-specific workflow. Modifying those is not really within the scope of xhydro.

<xarray.Dataset>
Dimensions:     (id: 1, time: 67935)
Coordinates:
  * id          (id) <U11 '1'
  * time        (time) datetime64[ns] 1999-10-01 ... 2022-12-30T18:00:00
Data variables:
    streamflow  (time, id) float32 ...
Attributes:
    description:              Variable de sortie simulation Hydrotel
    initial_simulation_path:  path_to_file
    creation_time:            29-05-2023 13:38:31

streamflow currently looks like this. I'll prepare an Issue/PR in xclim in the coming days to support streamflow instead of discharge:

<xarray.DataArray 'streamflow' (time: 67935, id: 1)>
[67935 values with dtype=float32]
Coordinates:
  * id       (id) <U11 '1'
  * time     (time) datetime64[ns] 1999-10-01 ... 2022-12-30T18:00:00
Attributes:
    units:          m^3 s-1
    description:    Streamflow at the outlet of the river reach
    standard_name:  outgoing_water_volume_transport_along_river_channel
    long_name:      Streamflow

sebastienlanglois commented 1 year ago

Thanks, I agree this is starting to look like something interesting!

For source (as in data source) and flag, the good thing about having them as dimensions is you can do this :

ds.streamflow.sel(source='DEH', flag='R')

which seems cleaner than relying on ds.where and I think it would also be more computationally efficient (but we could test it to be sure). I've seen them used as variables sometimes as well but didn't think it was better and it creates a lot more variables.

Also, we have to figure out how to keep the data source. If we only have one (ex: DEH), then we could store the info as attributes, but in the general case, we will have combined data from multiple sources. For now, I don't see a better way than to use a dimension but maybe we can find another alternative.

richardarsenault commented 1 year ago

Hi! I'm late to the party, but it seems that this is progressing nicely. I think the idea that xdatasets be a more "general" tool and xhydro can use only the part of the functionality that is needed would be the best approach. The source as dimension I think is cleaner than the alternative. That way we can more easily parse through the data and keep things separate if need be. Perhaps we could query David on this, he has a lot of experience in this type of data management. In any case, I think that this is going well and I'll keep an eye out for updates!

sebastienlanglois commented 11 months ago

I think we can close this issue. We have documented here how the inputs are expected to be in order to work with the frequency analysis subpackage. I suggest we open separate issues for other subpackages once we get there.