opengeospatial / ideas

Public repository for Innovation Program Ideas
20 stars 3 forks source link

R&D for Earth System Grid Federation #67

Open huard opened 6 years ago

huard commented 6 years ago

The Earth System Grid Federation (ESGF) collectively manages earth science observations and simulation data sets. The institutions that participate jointly develop and operate a P2P infrastructure hosting Petabytes of scientific data. This infrastructure is composed of nodes offering a large number of science related services, namely metadata compliance checking, data publishing and updates, version control, errata, etc.

Over the years, the ESGF has learned a great deal about running an international, federated infrastructure. I suspect that some of this experience could be contributed back to the OGC community and inform future Testbed specifications. I also suspect that the Testbed platform could be leveraged by ESGF to accelerate its development cycle and reach a wider developer audience.

One potential area of development would be on federated user services. Indeed, ESGF nodes will increasingly support analytical services. What this means is that ESGF nodes won't only distribute raw data to users, but also be able to offer pre-processing algorithms (ie averaging, subsetting, interpolation). A user request for averaged variables over multiple models would thus have to locate the data store for each input file, then map each file to a server offering the averaging algorithm, execute the processes and aggregate the output so that it's straightforward for the user to collect the results. This raises a number of interesting question about the criteria used to select analytical services, reproducibility, integrating on-demand cloud providers, caching, collecting error logs, launching retries with failed jobs, etc.

perrypeterson commented 6 years ago

Has ESGF considered the use of the DGGS Standard for federating user services? DGGS as an integration engine could act to federate all ESGF data sources without colocating data. As demonstrated in the Arctic Spatial Data Pilot Project: https://vimeo.com/204787821 DGGS is an enabler of multi-source/distributed scientific analytical services.

huard commented 6 years ago

I can't answer that question, but my understanding is that the spatial representation of datasets has not been identified as problematic.

perrypeterson commented 6 years ago

Indeed, spatial representations are not problematic as single sources. However, if the intent is to integrate data for mutli-source analysis - data integration tends to be an expensive requirement for almost all spatial analysis - then DGGS solves that grand challenge. This is not to say all data should be pre-encoded to a DGGS. DGGS are designed to be data agnostic data integration engines and can be placed at the client side. However, efficiencies exist if DGGS is used on the server side but none of the data sources used in the USGS/NRCan pilot were pre-encoded. There has been a lot of discussions lately around DGGS aligned datacubes which would be a case to support its use on the server side. I understand datacubes rely on gridding data anyway so why not use an optimized equal area DGGS?

huard commented 6 years ago

Interesting. While I totally agree that for users, regridding can be a pain, I wanted this issue to focus on server side considerations for large, federated data providers. Providing multiple output formats should definitely be part of the discussion, but my point was rather to outline the whole gamut of challenges faced by ESGF and see how the OGC community could contribute.

tomLandry commented 6 years ago

I think you both express the ideas from different sides, providers and consumers of data. ESGF falls mostly in the provider category, and would in my opinion constitute an examplar of a federated infrastructure that has been running for several years now. For example, some issues has been solved and demonstrated for a while now (https://github.com/opengeospatial/ideas/issues/53), while other are being actively developped (https://github.com/opengeospatial/ideas/issues/29). As with several other major environmental data providers - and this has been discussed in the last OGC TC - modernization is also an important topic. In that sense, recommendations and work conducted in the NextGen thread (and even EOC thread) can shine a light on ESGF challenges.

To me, Perry's idea comes more from a data consumer perspective. For advanced integrated applications doing data mashups from different and heteregeneous sources, DGGS does seems to make things easier and smoother compared to regridding/datacubing/georeferencing pipelines. Can some of this work of consuming relevant and timely climate data be done from the client/consumer perspective (https://github.com/opengeospatial/ideas/issues/59)? Probably. But that is to me a long way (or at least a quite indirect way) to advance current ESGF challenges.

pebau commented 6 years ago

Interesting to learn! The goal described above - location-transparent user access to disparate data assets - sounds like what the EarthServer initiative has established: user sends analytics query to any member of the federation, federation jointly resolves query (including joins between remote datacubes), and send back the result to the user. Such federated queries have been demonstrated, for example, by combining precipitation data from ECMWF with Landsat8 data from NCI Australia. See here for a reference: P. Baumann, D. Misev, V. Merticariu, B. Pham Huu: Datacubes: Towards Space/Time Analysis-Ready Data.. In: J. Doellner, M. Jobst, P. Schmitz (eds.): Service Oriented Mapping - Changing Paradigm in Map Production and Geoinformation Management, Springer Lecture Notes in Geoinformation and Cartography, 2018