Closed sebastienlanglois closed 11 months ago
(P.S. On s'était donné le droit d'y aller en français dans les Issues, si tu préfères !)
I don't think that the goal should be to fully standardize how we get to the data (aka Data lakes --> Data catalog), since I don't think that we'll be able to accomplish that between several organizations. However, what we can and should standardize is what the data looks like on the xarray
level (Data sources & the output of Xdatasets). Xhydro
should be able to work on anything, including local, custom data, as long as it is formatted adequately.
So in regards to that, I have a few questions/comments on that DEH dataset:
variable
a dimension? xarray
already has a ds.data_vars
property, so it does not need a dimension for that. Or is it used for something that I'm not getting?spatial_agg
used? I feel like outlet
vs station
vs polygon/basin
is too big a difference (messes up id
, for example) to be a dimension? Those should probably be different datasets altogether, and that information relegated to a global attribute.timestep
. I don't see a world where we can have multiple timesteps (for example: daily & 1h & 15min) in the same dataset. You'd create a lot of wasted space in the file, while rendering tools such as xr.infer_freq()
or xclim
inaccurate.time_agg
looks like xclim's ensemble statistics. It's probably not the best way to do it, but we create new variables (streamflow_max
, streamflow_min
, etc.) instead of a new dimension. Same goes for some standardized indicators such as tg_max. The one exception is with percentiles
, which are given a dimension.id --> site (clisops) | station (Hydroclimatic Atlas)
source
--> realization (xclim)
. CMIP6 controlled vocabulary does use source
to describe the model name, but realization
covers more (member, experiment, driving model, etc.)latitude/longitude
--> lat/lon (in pretty much everything)
. At the very least, we want to be compatible with cf-xarray
.timestep
--> frequency (CMIP6 controlled vocabulary)
. This is to distinguish between the "model/observation timestep" and the frequency of the data that is currently being looked at.streamflow
--> discharge (xclim)
. As discussed a while ago, this is something that we could change in xclim
if there is too strong of a connotation to "regulated flow" with "discharge".t2m
in one of your previous examples, because that's how ERA5-Land names their surface temperature. However, an issue here is that no convention exists for reanalysis data, so we end up having to manage a bunch of different names for the same variable. As much as possible, I'd stick to CMIP names (tas
, in this example), since this is what xclim
expects to receive.As for some of your other questions:
cf-xarray
has a lot of nice goodies that can be called directly with xarray
by using ds.cf
. If a dataset lacks time bounds, but has a correctly formatted time dimension, you can use ds.cf.add_bounds("time")
. Ideally, we want our coordinates to be fully understood by cf-xarray
.pandas
(and thus probably xarray
) can understand them. It's probably just a matter of writing the right attribute(s) into the netCDF/Zarr.Thanks @RondeauG for your input! It's super valuable. I'll make my best to answer each questions/comments and we can discuss them further at our next meeting if required.
I agree that the main objective of this issue is to standardize hydrometric data at the xarray level.
Making sure that each organization's own local data can work with xhydro/xdatasets is really important. That being said, I believe it would be beneficial for xdatasets to support both local data and hosted open-source datasets (through a data catalog) once we define the xarray specifications for hydrometric data. This would provide users with the flexibility to access various data sources from wherever they are. However, I suggest that we create a separate issue to address this specific feature and refrain from prioritizing it at the moment.
Before diving into each question, I'd like to provide a clearer context regarding the rationale behind certain design choices we have made. In an ideal scenario, as hydrologists, we would have access to combined hydrometric data (such as streamflow, water levels, etc.) from multiple providers, alongside relevant auxiliary data (like weather inputs (basin averaged or at a station), geographic data, etc.), all conveniently available in one centralized location. Unfortunately, in the current state, we are required to process data from each provider individually and manually combine them for each study, which proves to be an exceedingly time-consuming task. Keeping this in mind, we have attempted to devise a format that accommodates most of this data while still maintaining a reasonable level of ease when working with it.
However, it's true that the data format currently involves numerous dimensions and so we need to find the right balance between having too many dimensions vs. too many different separate datasets that end up being join together for some use cases.
Additionally, we should perhaps consider distinguishing between what is available in xdatasets and what we specifically require to be returned for xhydro. For instance, while xdatasets may offer the flexibility to choose from various timesteps, we might required xhydro to be limited to a single timestep (per object/input dataset). So xhydro could use only a subset of the capabilities offered by xdatasets.
1: It is used for the start_date and end_date coordinates because they can vary from one variable to the other. However, it is probably redundant as a dimension for the variables themselves and we should probably remove them.
2: Consider this as an edge case scenario. Let's imagine a situation where you have a weather station located at the outlet, sharing the same identifier as the hydrometric station. While this may be uncommon for public data, it sometimes occur with our own data. In such cases, the spatial_agg dimension becomes essential for distinguishing between data specific to the station itself (e.g., point measurements like precipitation) and weighted averages across the entire basin (e.g., derived from precipitation grids). In my opinion, this particular dimension is the one that can be most readily eliminated.
3, 4. I feel like timestep (or frequency) and time_agg go hand in hand. If we decide to remove the timestep dimension, we could incorporate the information as a variable attribute, but this would restrict us to having only one timestep per variable. Additionally, in such a scenario, it would be preferable to have a consistent frequency throughout the entire dataset. As mentioned before, xdatasets may support the timestep/time_agg dimensions, but for xhydro, we could limit it to a single timestep, resulting in the elimination of these dimensions.
When it comes to file size, it is true that having NaNs in the dataset would result in some space loss. However, given the relatively small size of the datasets and considering appropriate chunking, I don't believe this would be a significant concern. In the event that it does become problematic, we can explore the option of utilizing sparse to transform the dataset into a sparse array, as demonstrated in this sparse array example
5 : The DEH was an example, but it's more likely that we'll be concentrating on consolidated datasets that bring together multiple data sources. Nonetheless, I totally agree that we should figure out a way to include more attributes in those datasets.
6 : I agree, will do.
7 : I agree we should discuss about these and use conventions wherever possible. I'll make some preliminary changes and we can discuss about it further later on.
Should we set up a meeting to talk about this ? I feel like there is a lot to cover!
Thanks for the context. As you say, I think that we can design xhydro
on only a subset of the capabilities offered by xdatasets
, then make it more complex only when it is required. Your #2, for example, might be something that affects the hydrologist, but doesn't really impact the computations in xhydro
.
It is used for the start_date and end_date coordinates because they can vary from one variable to the other.
That's indeed a tough one to implement cleanly. I can ask my colleagues to see if they have an idea. An alternative would be to compute that information as needed, since the start/end date isn't perfect either for station data (since you can sometimes have multiple missing years in-between the start/end date).
When it comes to file size, it is true that having NaNs in the dataset would result in some space loss. However, given the relatively small size of the datasets
The Atlas files can get pretty big! However, a bigger issue is that we will probably want to implement health checks on the data (such as using xclim
's missing value identification), and combining multiple frenquencies in the same dataset would completely break those chekups.
So let's say we were to keep dimensions at a minimum for xhydro. In that case, our xarray data model would need, at a minimum, the following dimensions:
Some xhydro modules would require additional dimensions (such as adding lat/lon if we want grids at specific basins (mask)). xdatasets can already handle that use case and we could use intake-esm/xscen for climate scenarios. I believe we will mostly figure these cases out as we go and adapt xdatasets/xscen consequently!
Coordinates
That's indeed a tough one to implement cleanly. I can ask my colleagues to see if they have an idea. An alternative would be to compute that information as needed, since the start/end date isn't perfect either for station data (since you can sometimes have multiple missing years in-between the start/end date).
It is not clear yet if all these coordinates (such as start_date and end_date) should be optional or not. However, if they are available, they would greatly enhance the efficiency and speed of calculations, especially when using a lazy loading approach. For example, users could easily filter stations based on criteria such as "all stations with at least 15 years of records" before loading them into memory. Without these coordinates, all stations would need to be loaded in memory beforehand.
To address the issue of stations missing records between start_date and end_date, we could potentially introduce a coordinate that indicates complete vs. incomplete years, similar to how HYDAT handles it. It would be fantastic if we could find a way to make xhydro work efficiently by leveraging the available coordinates to speed up calculations, while still providing the necessary functions even when those coordinates are not provided (at the cost of extra calculations). Furthermore, I'm definitely interested in exploring alternative approaches to handling the start_date and end_date scenario. If there's a better way of accomplishing this, I'd be excited to learn about it.
So, with these dimensions in place, we could create an xhydro object. If a user needed multiple timesteps or wanted to aggregate data over time, they could freely create as many xhydro objects as they would need.
Then, depending on the specific modules employed in xhydro, it is likely that additional dimensions would be incorporated. For instance, if simulations are involved, a dimension such as "realizations" might be added such as what you suggested. Similarly, for frequency analysis, a dimension like "return_period" could be included and so on and so forth.
I think we're getting pretty close to something!
Small question, but is flag
a dimension or a variable? I've ususally seen it as a variable.
I would definitely favor a coordinate that indicates which years are available, since "all stations with at least 15 years of records"
can be faulty with only the start/end dates (again, because of missing data in-between).
xclim
and could be leveraged here: no missing data allowed, maximum number of missing data per given period, WMO rules, etc.On a similar note, I've started to modify the raw Hydrotel outputs. This is my current work-in-progress. As far as I know, spatial coordinates such as lat/lon
are added somewhere later. I'll also have to further change id
and the global attributes, but in an Atlas-specific workflow. Modifying those is not really within the scope of xhydro.
<xarray.Dataset>
Dimensions: (id: 1, time: 67935)
Coordinates:
* id (id) <U11 '1'
* time (time) datetime64[ns] 1999-10-01 ... 2022-12-30T18:00:00
Data variables:
streamflow (time, id) float32 ...
Attributes:
description: Variable de sortie simulation Hydrotel
initial_simulation_path: path_to_file
creation_time: 29-05-2023 13:38:31
streamflow
currently looks like this. I'll prepare an Issue/PR in xclim
in the coming days to support streamflow
instead of discharge
:
<xarray.DataArray 'streamflow' (time: 67935, id: 1)>
[67935 values with dtype=float32]
Coordinates:
* id (id) <U11 '1'
* time (time) datetime64[ns] 1999-10-01 ... 2022-12-30T18:00:00
Attributes:
units: m^3 s-1
description: Streamflow at the outlet of the river reach
standard_name: outgoing_water_volume_transport_along_river_channel
long_name: Streamflow
Thanks, I agree this is starting to look like something interesting!
For source (as in data source) and flag, the good thing about having them as dimensions is you can do this :
ds.streamflow.sel(source='DEH', flag='R')
which seems cleaner than relying on ds.where
and I think it would also be more computationally efficient (but we could test it to be sure). I've seen them used as variables sometimes as well but didn't think it was better and it creates a lot more variables.
Also, we have to figure out how to keep the data source. If we only have one (ex: DEH), then we could store the info as attributes, but in the general case, we will have combined data from multiple sources. For now, I don't see a better way than to use a dimension but maybe we can find another alternative.
Hi! I'm late to the party, but it seems that this is progressing nicely. I think the idea that xdatasets be a more "general" tool and xhydro can use only the part of the functionality that is needed would be the best approach. The source as dimension I think is cleaner than the alternative. That way we can more easily parse through the data and keep things separate if need be. Perhaps we could query David on this, he has a lot of experience in this type of data management. In any case, I think that this is going well and I'll keep an eye out for updates!
I think we can close this issue. We have documented here how the inputs are expected to be in order to work with the frequency analysis subpackage. I suggest we open separate issues for other subpackages once we get there.
Addressing a Problem?
While netCDF/zarr and CF standards are widely used for storing and exchanging n-dimensional arrays in the realm of climate sciences, there is presently no comparable standard or specification in place for n-d array hydrometric data (WaterML exists but it consist of XML files and still require lots of processing to use with modern python stack).
Furthermore, as we report to diverse organizations, each entity already has its own unique methods for organizing and sharing hydrometric data (ex: miranda for Ouranos).
To foster collaboration, facilitate development, enable rigorous testing with real data, and enhance reproducibility of studies conducted through Xhydro, substantial benefits can be gained by standardizing hydrometric data and ensuring its universal accessibility on the internet through open-source means wherever feasible.
More specifically, this would involve:
While it may appear as a significant undertaking, I have already dedicated several months to implementing a solution, drawing upon the advancements achieved in PAVICS/PAVICS-Hydro. I am excited to present what I have so far and seek valuable feedback from experts in the field.
Potential Solution
Here is a simplified overview of the solution currently being developed, which follows a similar approach to accessing large-scale climate data as described in this GitHub issue:
Here is an example for an actual study that we are working on right now. The requirements are:
This can be achieved simply with the following query by leveraging xdatasets's capabilities :
Below is the list of retrieved data that can be easily viewed :
The hydrometric data specification presented above is the result of extensive deliberation and collaboration with @TC-FF drawing from our real-world experience of utilizing this kind of data.. Through this process, we have determined that this format enables the representation of a wide range of hydrometric data types (flow rates, water levels, basin-scale or station-specific weather data), at various time intervals, with different temporal aggregations (maximum, minimum, mean, sum, etc.), spatial aggregations (such as a point (outlet or station) or polygon (basin)), and includes information about the data source. We are seeking valuable feedback on the proposed data specification for representing hydrometric datasets, including suggestions for improved variable naming, adherence to conventions, and potential modifications to the data model itself. This can include for example adding timezones info, time bounds, etc. Your input on these aspects would be greatly appreciated.
Also note that, we intend to have approximately 20 000 daily-updated gauged basins in xdatasets with precomputed climate variables at each basin (temperatures, precipitation, radiation, dew point, SWE, etc.) from different sources (ERA5, ERA5-Land, Daymet, etc.) by the end of July. To retrieve the additional variables, one will simply need to include them in the query. The majority of basins are located in North America, with additional regions worldwide utilized for training deep learning algorithms. For this, we build upon the work already accomplished in HYSETS and CARAVAN, but our focus is on making it operational and easily queryable.
Additional context
There is much more details to be said regarding the various components of the presented solution. Additionally, xdatasets offers a broader range of capabilities (such as working with climate datasets such as ERA5 directly) than the simple example presented here, with even more ambitious plans in the roadmap. However, considering the length of this post, I will conclude here to let you absorb all the details. If you have any questions or suggestions, please don't hesitate to reach out.
Contribution