internetofwater / about.geoconnex.us

A repository serving as a directory for work in the geoconnex.us project. It also houses the overall project board.
Creative Commons Zero v1.0 Universal
2 stars 0 forks source link

WaterML2, STA, EDR, and JSON (WIP) #21

Open ksonda opened 1 month ago

ksonda commented 1 month ago

Establishing an "IoW" opinion on data exchange, format, and web standards

TL;DR : The Way Forward

In general, the following two use cases represent at a high level what the Internet of Water is after. Both use cases focus on data that is one of four types:

  1. vector geospatial features with attributes, without time variation (e.g. locations of dams and data about them like their name, construction material, etc,)
  2. Time series data associated with vector geospatial features (e.g. weather station air temperature data)
  3. Raster data (e.g. weather radar maps)
  4. Time series associated with raster grid cells (e.g. weather radar map over time)

Depending on the use case, water data presents some peculiarities with respect to these data tyopes

Use Case 1: Web App Development for Community Engagement and Education

Context: To enhance community engagement and education regarding water resource management, a web application is developed to make water data accessible and understandable to the general public. Example: https://maps.waterreporter.org/kuqRV2QhcCg5/

Scenario: A non-profit environmental organization aims to create an interactive web application that allows users to explore local water quality and usage data, including with a primarily visual, map-oriented interface. This application utilizes JSON-encoded data due to its ease of manipulation in common JavaScript frameworks and its compatibility with geospatial information systems. The app features interactive maps, charts, and educational content to explain the implications of water quality on community health.

Implementation Issues:

- Data Exchange and Web Standards: Within the application, a simple JSON format is used to facilitate data visualization and serves as the the format or as an intermediate format for data export. Data processing is generally client-side in a web browser or using desktop software on a personal computer.

- Simplified metadata: Users understand the meaning of the data meant to be communicated by the application without being overwhelmed by technical detail. Typically metadata required within the data itself is just a station identifier and location, variable name and definition, and units. Other contextual information such as the overall purpose and methodology of the application and its data is provided via the UX independent of the data back end.

- Data Accessibility: Lightweight, standard RESTful APIs underpin the web application, which includes a GUI that allows for data downloads, and can be exposed independently for "power users". The general query pattern allows for querying based on Space/Location, time windows, and variable.

Use Case 2: Scientific Analysis for Water Resource Management

Context: In the context of managing and protecting water resources, a scientific analysis platform is developed for researchers and policymakers. Example: https://hydrogen.princeton.edu

Scenario: A governmental agency develops a platform to facilitate the analysis of water quantity and quality across multiple jurisdictions. Researchers can access historical and real-time data, use sophisticated modeling tools, and collaborate on water management strategies.

Implementation Issues:

- Data Exchange and Web Standards: The platform rests on the aggregation of data from multiple government agencies that use a common data standard designed to carry maximal metadata. Data ETL'd into a format that is optimized for high-performance computing.

- Complex Metadata: In this scenario, detailed metadata is vital to ensure that data can be critically evaluated and appropriately used in scientific analysis. Researchers and policymakers rely on comprehensive metadata to assess provenance, and suitability for specific studies or regulatory needs. Metadata typically includes information about data collection and analysis equipment, procedures, and methodologies; data quality codes; detection limits or censor codes, environmental conditions at the point and time a measurement was taken, etc.

- Data Accessibility: The application rests on a flexible programmatic query interface and/or archives of cloud-optimized data available on the web in a documented place. Data processing occurs using high-performance infrastructure. The application includes some GUI and programmatic mechanisms for exporting highly customized data views with varying levels of metadata. Query patterns and exports are enabled for space, time, variable, methodological details, and observation-level data quality, statistical, and contextual metadata.

These use cases demonstrate how the IoW's Principles can be operationalized to develop solutions that not only meet the technical requirements of data management but also address broader goals of community engagement, scientific research, and regulatory compliance in water resource management.

Among the Principles are:

The scope of the Principles is meant to extend to all water data and related use cases. Certain classes of regulatory and administrative use cases and their required data, particularly around water infrastructure and water rights administration, are so bound up in state-level regulatory contexts that universally applicable best practices with respect to data standards are rare. However, water observation (and model) data that represent the quantity and quality of water in the natural environment is generally portable across jurisdictional boundaries for a variety of water science and management use cases. Historically, this means that the Internet of Water technical team has made recommendations with respect to water observation and model data that:

Unfortunately, the current state of play in water data standards means that these principles lead to contradictory recommendations for water data content standards, water data exchange format standards, and water data web services, even when constraining to standards endorsed by the OGC. Moreover, between the two use case extremes, there is no silver bullet combination that serves both. Let us review, beginning with the underlying data model agreed to by water data community at least conceptually, and then proceed to implications for extant data exchange and web standards. To orient us to the interrelationships between all of these things, let's use a consistent dataset as a motivating example throughout. Here is a dataset daily values for streamflow and stage (height) for two separate USGS locations on the same river:

Station metadata: image

Discharge and Stage data: image

The Data Model: WaterML2 (Part 1: Timeseries)

WaterML2 (Part 1) is a data model adopted by OGC in 2012. It originated with something called the "Water Data Transfer Format", whose use case was to support the submission of water time series data from local government agencies and utilities in Australia to the Australian Bureau of Meteorology so that standardized data could be used to automatically calculate consistent water budgets so that water demand management measures could quickly be adopted in the context of a sever drought Australia faced in the mid-late 2000s.

The WDTF was later iterated on at OGC with participation from USGS and several international equivalents as well as CUAHSI as a representative of the international academic hydrology community to support water science. This became WaterML1 and later WaterML1.1, which is still used in the CUAHSI data infrastructure today since 2010, and the current WaterML2 standard (adopted 2012, revised 2014) used by a variety of government agencies and Kisters, one of the two major water data management software vendors. (telling that the other major vendor, Aquatic Informatics, doesn't use it at all...we'll get there).

It is a rather comprehensive data model that includes many entities, metadata elements, and supports/requires explicit references to external controlled vocabularies. Here is the complete content hierarchy:

The Good

WaterML2 provides a spec for very complete metadata that allows scientists and regulators to to make a decision as to what purpose the data is useful for, and to combine it with other compliant datasets. It is used for water data exchange between USGS and NOAA to serve the National Water Model, and for data aggregation for the National Groundwater Monitoring Network.

The Bad

WaterML2 only has a standard serialization in XML. The only open-spec web service that provides WaterML2 is a SOAP /XML interface called the SensorObservationService which is generally considered to be out of date and difficult to use. There is an open-source SensorObservationService provider but as you can see from its recent PR history it's kind of on life-support. The XML is so complex to render that even USGS is discontinuing it for its high-frequency "instantaneous values" web service after a database upgrade. Few, if any functional open source clients exist for WaterML2 to meet the "Web" use case elaborated in Use Case 1. Here is our example dataset as WaterML2 XML: https://gist.github.com/ksonda/91a3ae86a01003f73e05ddf1d507e249

With WaterML2 XML having little to no support on either the data client side or the data management/provider side, we need a way to express WaterML2 in a way that is more friendly to today's web developers and water data scientists, perhaps in a flavor for each of these personas. This brings us to data exchange formats!

Data Exchange Formats

Above we covered the existing XML spec for WaterML2. What other candidates are there?

  1. CSV. That's fine, probably should be an option as a data export, but XML and then JSON are used on the web for nested data structures and referencing controlled vocabularies for good reasons. Requires multiple files with bespoke linking patterns to represent complex metadata linked to time series observations. Also difficult to represent complex feature geometries well and error-free in CSV.

  2. NetCDF. This is a binary file format for multidimensional arrays like timeseries data about many locations. It is very useful for bulk data exchange, and is commonly used to distribute things like regional and continental-scale climate or streamflow model output but requires specialized libraries to read and is a bit cumbersome for use in web development.

  3. zarr. This is a new cloud-optimized multidimensional data array format that has been demonstrated to be more performant than netCDF or even parquet for this type of data. Like netCDF it requires specific libraries to read, probably more useful as a back-end than a data delivery format for web clients.

  4. GeoJSON. The current most common standard for displaying geospatial information in web maps. It was designed as plaintext JSON counterpart to common binary data formats and file databases used in GIS applications. It includes a detailed specification to represent different kinds of geometry (point, polygon, line, etc.), and then includes a specification to include attributes. As a flavor of JSON, infinite nesting is possible. However, there is no standard for how to represent nested concepts. So, vanilla GeoJSON clients are not guaranteed to present nested metadata or time series data represented in GeoJSON documents as the author intends them to. For example, there is no guidance anywhere for web developers on how to read, translate, or unify between these two documents. Feel free to copy and paste into https://geojson.io to see how they differ to a standard client:

  5. SensorThings JSON. SensorThings API is an open specification for an OData endpoint constrained by an entity-relationship model designed for IoT sensors (we'll explore in more detail in the next section). Its data model is more opinionated than netCDF, zarr, and geojson but less opinionated than WaterML2, while flexible enough to incorporate all required and optional WaterML2 metadata. Being OData, essentially any query that can be mapped to SQL select statements on the data model will return a a valid JSON formatted response. For example, see the query below and inspect the JSON response in your browser. This query produces a JSON-formatted response with essentially the same information as the WaterML2 XML example.

https://labs.waterdata.usgs.gov/sta/v1.1/Things?$filter=name%20eq%20%27USGS-02085000%27%20or%20name%20eq%20%27USGS-0209734440%27&$expand=Locations,Datastreams($expand=ObservedProperty,Sensor,Observations($top=30))

However, there is no document anywhere that recommends this query in particular. OData is so flexible (similar to GraphQL, convo for another day) that depending on the exact query, the desired entities may show up nested within each other in varying ways, so SensorThings API clients are really designed only for specific query patterns, which may not produce identical behavior across data providers who have added metadata to their SensorThings data in varying ways. Conversely, a given STA instance may provide different information to different clients nominally designed fro the same use case due to different assumptions about how the data and metadata are mapped to the STA data model. You can see this variability by exploring the different SensorThings API endpoints visualized here: https://api4inspire.k8s.ilt-dmz.iosb.fraunhofer.de/servlet/is/226/

  1. CoverageJSON. CovJSON is an open specification for "coverages" designed for web applications to display multidimensional arrays in an earth observations context. Much as GeoJSON serves as plaintext web counterpart to GIS file databases, CovJSON serves as web counterpart to netCDF. covJSON has an opinionated and simple data model that includes some requirements for specifying the ObservedProperty and how to represent geometry, time series, and depth profiles in a way that GeoJSON, netCDF, zarr, and CSV do not have opinions about. However, it is simpler and less opinionated than the WaterML2 data model, in particular with no fields in the current spec that would support station metadata and no opinion on how to represent observation-level metadata. However, unlike SensorThings JSON, its strong opinions on the arrangement of geometry, time, and variable observation arrays yield predictable and interoperable responses for covJSON web clients, and it can support raster data. Here is an example covJSON document representing our exampled data. Feel free to copy it into the covJSON playground to see how a vanilla client represents the data.

Web Services

I use the perhaps old term "web services" to encompass standards used to make data queryable on the web without downloading the entire dataset to a local place before filtering and manipulating it. This includes API standards and cloud-optimized file formats. I will ignore zarr and geoparquet as strategies for now, since we do want to support vanilla web developers used to working with JSON.

OGC-API Features

This is a standard for querying vector data, like say, locations and attributes of water monitoring stations. This standard requires that HTML and GeoJSON be output formats, although others are allowed. Our own https://reference.geoconnex.us implements this standard. The main opinion of the standard is the following RESTful resource pattern:

/collections give a list of the names, descriptions, and some provenance information for all datasets represented in the API endpoint

/collections/{collectionId} give more detailed metadata about the identified collection

/collections/{collectionId}/items gives a (default GeoJSON) feature collection of all entities in the collection (subject to pagination limits)

/collections/{collectionId}/items/{itemId} gives a specific (default GeoJSON) feature

The Good

Simple RESTful pattern. Client and server implemented by ESRI and most open source GIS softwares. Good geospatial query support

The Bad

The normal limitations of GeoJSON apply. No opinions on a canonical data model. Theoretically one could use OGC-API features to give a WaterML2 XML output format if the underlying data had all of the elements available. The GeoJSON would be nonstandard though. Complex multi-entity filtering is not possible without guidance outside of the spec. Not suitable for raster data

OGC SensorThings API

This API spec is really just OData with the following entity-relationship model image

Full documentation about how it works is here

The Good

The Bad

OGC API Environmental Data Retrieval

This is a standard to query both raster and station-based data over space, time, and parameter. Its default data output format is covJSON, although others could be supported. It is highly opinionated in which query combinations are possible and what the resulting output would be. The main opinion of the standard extends the OGC-API Features model by allowing the query of a /collections/{collectionId} by geospatial filters combined with parameter-name and datetime. Other functions allow for creating on-the-fly outputs at alternative spatial resolutions

The Good

The Bad

The Way Forward #

In consideration of all of the above, in my opinion, we should

  1. Develop a serialization of WaterML2 suitable for Use Case 1 (web map). I think this is a minimal mapping to covJSON.
  2. Develop a serialization of WaterML2 suitable for Use Case 2 (comprehensive data exchange). This could be an extra Best Practice spec for covJSON, or it may have to be a custom JSON serialization that leverages GeoJSON for geometry but otherwise might be similar to just converting the WaterML2 XML to JSON structures closely.
  3. Implement Use Case 1 and Use Case 2 output formats for pygeoapi's EDR, along with guidance for collection configuration and query combinations for station metadata and time series data, enabling EDR for explicitly hydrologic data exchange.
  4. Create a STA best practice spec for its data model to cover Use Case 2, making explicit which waterml2 elements should go where in the STA data model. Why? The existing STA database and CRUD specifications and their open source implementations are a compelling and relatively easy-to-use framework for the actual management of station-based observation data. hydroserver2 in particular provides excellent GUIs for data management for low-capacity data providers.
  5. Create a pygeoapi plugin to provide EDR Use Case 1 and Use Case 2 queries and output formats using a WaterML2-spec STA endpoint as a back end. Why? No need to implement a novel data back end when we did it for STA. Also, allows STA providers to painlessly create an interface that produces predictable, canonical data outputs for Use Case 1 via the EDR proxy while maintaining the flexible STA queries to support Use Case 2.

Related issues:

https://github.com/opengeospatial/ogcapi-environmental-data-retrieval/issues/373 https://github.com/wmo-im/wis2box/issues/703 https://github.com/opengeospatial/CoverageJSON/issues/174

webb-ben commented 1 month ago

Example EDR: https://github.com/opengeospatial/ogcapi-environmental-data-retrieval/blob/master/deployments.md

Example OAF: https://github.com/geopython/pygeoapi/wiki/LiveDeployments

ksonda commented 4 weeks ago

See this endpoint for for a lot of ODM2/WaterML2 features forced into STA https://beta.hydroserver2.org/api/sensorthings/v1.1/

webb-ben commented 4 weeks ago

Would be interested to hear opinion about the usefulness of the various EDR query options. As we explore use cases will we be implementing all of them? or a standardized subset?