pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
699 stars 189 forks source link

Report from NASA workshop "Enabling Analytics in the Cloud for Earth Science Data" #212

Closed rabernat closed 6 years ago

rabernat commented 6 years ago

In February I attended a NASA workshop on Enabling Analytics in the Cloud for Earth Science Data. Below is a report of the meeting. This is highly relevant to our ongoing efforts.

Cloud Analytics Workshop Report.pdf

Some highlight from the recommendations:

NASA Earth Sciences (HQ, ESDIS, and DAACs) should work toward the adoption of a service-based architecture that consists of data services and analytics services. Data services should adopt and advance industry specifications that support preprocessing and access to cloud-optimized data stores. Collaboration with external data partners is paramount to support interoperability with other data holdings through the adoption of common APIs, terms, and definitions. Analytics services build on the data services to provide methods that reduce and transform the data for the purpose of scientific exploration. These services should be developed within a collaborative NASA ecosystem that supports innovation, incentivizes reuse of services, and strives towards a common set of vocabularies.

This challenge led to discussion of the need for “Analytics Optimized Data Stores” (AODS) to help address both the Volume and Variety challenges at the same time. Workshop participants defined AODS as data stored in a fashion that:

  1. Minimizes the need for data-wrangling and preprocessing for a large community of users
  2. Uses cloud-native storage forms to support fast & parallel access
  3. Utilizes optimized storage structures for queries relevant to their users in order to enable iterative science analysis
  4. Exploits cost-effective, affordable (including egress / transfer costs), and sustainable methods, and can incorporate non-NASA data from other agencies, nations, and the user community at large.

NASA should work toward Analytics-Optimized Data Stores for data to serve as building blocks for analytic tools and services, with a focus on building and exposing APIs (web-services). Construction of AODS must be transparent with well-documented provenance to foster trust by the user community. The refined AODS definition and accompanying Extract-Transfer-Load/Preprocessing services should be developed with input from a survey of analysis use cases to identify common preprocessing operations needed to support the analyses. The use cases can be categorized by data characteristics (spatial ROI, temporal coverage, swath/grid/other), transformation operations, software languages/packages used, target user communities, and target analysis types (e.g., time series analysis, uncertainty quantification, statistical analyses). Pilot studies should be undertaken to identify and prototype a framework of techniques and software for the translation of archived data to AODS; pilot results can help optimize data format and structure for cloud storage.

jhamman commented 6 years ago

@rabernat - thanks for sharing. Its nice to see substantial overlap with the current Pangeo efforts. Are there additional/tangible ways that we can be engaging with this effort?

rabernat commented 6 years ago

I think there are at least two ways this can impact us:

guillaumeeb commented 6 years ago

Yes thanks for sharing, I believe we are in line with all that at CNES too, and that's part of why I closely follow the Pangeo effort. Particularly interested in work that may be done in the context of SWOT mission, as this is a US-French collaboration. I'm internally doing some lobbying to find a use case in the context of this mission for testing Dask and Xarray at scale.

tomLandry commented 6 years ago

Cool stuff. Thanks for sharing indeed. I'll try to get "the machine" aligned with this. FYI, I recognize several of these names and know some of them, as a Earth System Grid Federation executive committee member: https://esgf.llnl.gov/committee.html

rabernat commented 6 years ago

Andy Bingham, a program manager at NASA, is mounting a "Get Ready For SWOT" program, and he has confirmed that we will get some funding to deploy and test Pangeo on "virtual" SWOT data, focusing on analytics-optimized storage in the cloud. @guillaumeeb, it would be great to make this collaborative with our French counterparts, since that is very much the spirit of SWOT. Let me know if I can do anything to help move forward your efforts.

tomLandry commented 6 years ago

I was not aware of the SWOT program. I notice partnerships with the Canadian Space Agency (CSA). We did projects in the past at CRIM with CSA, and might do it again in the future. Also, I'd like to report that as participant to OGC Testbed-14, CRIM is sponsored by European Space Agency (ESA) to deliver implementations and best practices in the context of Thematic Exploitation Platforms (TEP).

willirath commented 6 years ago

@lesommer: Isn't NATL60 part of SWOT? Would be very interesting to see how Pangeo handles data at this scale. (And I'd love to have some NEMO output to play with.)

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.