pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
703 stars 189 forks source link

Paper: Requirements for a global data infrastructure in support of CMIP6 #179

Closed rabernat closed 6 years ago

rabernat commented 6 years ago

A new discussion paper has appeared on Geoscientific Model Development entitled Requirements for a global data infrastructure in support of CMIP6

https://www.geosci-model-dev-discuss.net/gmd-2018-52/

The World Climate Research Programme (WCRP)'s Working Group on Climate Modeling (WGCM) Infrastructure Panel (WIP) was formed in 2014 in response to the explosive growth in size and complexity of Coupled Model Intercomparison Projects (CMIPs) between CMIP3 (2005-06) and CMIP5 (2011-12). This article presents the WIP recommendations for the global data infrastructure needed to support CMIP design, future growth and evolution. Developed in close coordination with those who build and run the existing infrastructure (the Earth System Grid Federation), the recommendations are based on several principles beginning with the need to separate requirements, implementation, and operations. Other important principles include the consideration of data as a commodity in an ecosystem of users, the importance of provenance, the need for automation, and the obligation to measure costs and benefits. This paper concentrates on requirements, recognising the diversity of communities involved (modelers, analysts, software developers, and downstream users). Such requirements include the need for scientific reproducibility and accountability alongside the need to record and track data usage for the purpose of assigning credit. One key element is to generate a dataset-centric rather than system-centric focus, with an aim to making the infrastructure less prone to systemic failure. With these overarching principles and requirements, the WIP has produced a set of position papers, which are summarized here. They provide specifications for managing and delivering model output, including strategies for replication and versioning, licensing, data quality assurance, citation, long-term archival, and dataset tracking. They also describe a new and more formal approach for specifying what data, and associated metadata, should be saved, which enables future data volumes to be estimated. The paper concludes with a future-facing consideration of the global data infrastructure evolution that follows from the blurring of boundaries between climate and weather, and the changing nature of published scientific results in the digital age.

The format of this journal allows for public comment on these papers. It would be great for us to draft an official response from the pangeo perspective that summarizes our wishes and recommendations regarding CMIP6 data.

Thoughts?

niallrobinson commented 6 years ago

Hi @rabernat - I think that's a great idea.

First off, I've just read the abstract, so this is subject to reading the actual paper :D

I think something advocating for what we've been over-dramatically calling !!THE PANGEO HYPOTHESIS!! would be useful ie. thin client, very elastic, lazy evaluation. My feeling is that this kind of functionality then moves us on to the next batch of challenges, which would be good to raise in such a response i.e. data discovery; automatic hypothesis generation etc. We have a book chapter under review currently which explains a lot of this. If it's out in time it would be useful to point to.

Basically - we'd very much up for being involved in writing something like that

darothen commented 6 years ago

That's a great idea, @rabernat. I think the paper is quite nicely written and broaches many key issues, hopefully many members of our community can carve out some time to review it in detail and share notes.

Two angles come to mind for an official comment:

  1. Broad community accessibility to the ESGF replicas; is this development pattern consistent with our view of how our respective communities will move towards accessing and analyzing data? What is the gap - or rather, the opportunity - to be bridged here?

  2. "Future science" is alluded to in the final section (p. 25, lines 5-7) in the couched terms of "data analytics at large scale." I think this drives most of our motivation with Pangeo (certainly my efforts!) but perhaps a critical review from the early practitioners of this discipline within our midsts would be of value to the broader community. Can we identify, in a preliminary sense, where the data infrastructure plan for CMIP6 falls short of empowering those new scientific analyses and approaches? This paper is clearly set up to be revisited a few years down the line to vet predictions - we should aim for our own predictions, too.

Looking forward to hearing others' comments and ideas! I would be enthusiastic about contributing to a draft response, even with my limited spare time at the moment.

JiaweiZhuang commented 6 years ago

Thanks for sharing the paper! Some very specific thoughts:

1. About computational platforms

CMIP6 will have a serious data transfer issue as discussed by Dart et al. (2017). The Balaji et al. 2018 paper doesn't say much about this issue, but it does say about "secondary repositories" (p22, line 5):

"Based on experience in CMIP5, it is expected that a number of “special interest” secondary repositories will hold selected subsets of CMIP6 data outside of the ESGF federation. This will have the effect of widening data accessibility geographically, and by user communities, with obvious benefit to the CMIP6 program. The WIP encourages the support of these secondary repositories where it does not undermine CMIP6 data management and integrity objectives."

I expect some public clouds will be used to increase data accessibility, considering that part of CMIP5 data have been hosted on AWS since 2013. If so, the Pangeo project will be very relevant since it is pretty cloud-centric.

2. About software tools

ESMValTool seems to be the official tool for analyzing CMIP data, but it might not integrate very well with cloud platforms.

Take regridding as an example (since I am most familiar with it). Regridding is an important pre-processing step especially for ocean data, and the paper says a lot about it (p11 line 10):

"Regridding remains a contentious topic, and owing to a lack of consensus, the WIP recommendations on regridding remain in flux. The CMIP6 Output Grid Guidance document outlines a number of possible recommendations, including the provision of “weights” to a target grid. Many of the considerations around regridding, particularly for ocean data in CMIP6, are discussed at length in Griffies et al. (2016).:"

ESMValTool contains many regridding functionalities. By skimming through its source code I believe it relies on Iris which calls ESMPy internally. However, ESMPy's native parallelism is MPI+horizontal decomposition (JiaweiZhuang/xESMF#3) which do not play well with the distributed computing environment on the cloud. xESMF will be able to use dask pretty natively (not yet, but close...) and should work better with the cloud.

rabernat commented 6 years ago

I have drafted a comment on this paper https://docs.google.com/document/d/1UagKt9RedKSkJkRseOc84XKj5QmFlnl27opdCyRaKJ8/edit?usp=sharing

I welcome any edits anyone wants to make. If you agree with what I wrote, don't hesitate to add your name (even if you don't make any changes). The comments are technically just single author, but we can sign the comment collectively.

There are a lot of good ideas raised in this thread that might be beyond the scope of this brief comment (which really should be about the paper at hand). I recommend we take these up in a longer publication via #77.

Beyond those on this thread, I would love to get some input from @naomi-henderson, who has extensive experience processing CMIP datasets.

naomi-henderson commented 6 years ago

Hi Ryan, Fantastic! I love your response and would like to add my name. Cheers, Naomi

On Mon, Apr 2, 2018 at 10:30 PM, Ryan Abernathey notifications@github.com wrote:

I have drafted a comment on this paper https://docs.google.com/document/d/1UagKt9RedKSkJkRseOc84XKj5QmFl nl27opdCyRaKJ8/edit?usp=sharing

I welcome any edits anyone wants to make. If you agree with what I wrote, don't hesitate to add your name (even if you don't make any changes). The comments are technically just single author, but we can sign the comment collectively.

There are a lot of good ideas raised in this thread that might be beyond the scope of this brief comment (which really should be about the paper at hand). I recommend we take these up in a longer publication via #77 https://github.com/pangeo-data/pangeo/issues/77.

Beyond those on this thread, I would love to get some input from @naomi-henderson https://github.com/naomi-henderson, who has extensive experience processing CMIP datasets.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/179#issuecomment-378108401, or mute the thread https://github.com/notifications/unsubscribe-auth/AeANV7JDCEcnLeiCiW9Fo5EWVmKYEi_yks5tkt7jgaJpZM4S4xon .

-- http://orcid.org/0000-0002-9531-2159

niallrobinson commented 6 years ago

Yeh - I think this is spot on, so have taken the liberty of adding my name! I've also made a couple of comments which you can take or leave - I'd be very happy for it to go in as is.

jacobtomlinson commented 6 years ago

I have also made comments and added my name. This is hugely representative of the mindset we are trying to foster!

kmpaul commented 6 years ago

Thanks for doing this, Ryan! I think this is fantastic!

jhamman commented 6 years ago

@rabernat - nicely done. I made some minor changes and added a few comments for discussion.

darothen-cc commented 6 years ago

Looks great! I'm going to add some comments as well this evening, will ping back here once they're in. Thanks for taking the initiative on this!

rabernat commented 6 years ago

Thanks for your comments everyone!

Please don't hesitate to just directly edit (or "suggest edit") the document to insert your changes. I will leave this up for the night and then merge and submit it tomorrow.

We clearly have enough ideas for a full-length position paper (#77). I'll get that started asap in order to keep the momentum going.

darothen commented 6 years ago

Comments added; really fantastic work and discussion from everyone.

hot007 commented 6 years ago

Also added comments, I haven't added my name because I'm not active in Pangeo and I did used to be an ESGF member so it might look a bit odd, but I definitely endorse this document. My main comment would just be in support of toning down emphasis on commercial cloud, as the same objectives may be achieved with govt-funded cloud adjacent to data repositories (as with NCAR), which in my mind is an optimal solution. For me where the pangeo community can add value is not in putting the data in the cloud, but in leveraging the fact that data is published online (e.g. remotely accessible via opendap) to enable 'big data' analysis workflows in scalable cloud architectures.

rabernat commented 6 years ago

Claire and others: do not hesitate to add your name to the document if you wish. There is no “official” membership in Pangeo.

I appreciate the diversity of opinions on the commercial cloud. My own opinion is this: based on my recent experience, I am extremely excited about its potential. We have already achieved things that would have been impossible on smaller academic clouds. I agree that a more agnostic tone is appropriate for this document. But for the future, my bet is on commercial cloud over government / academic solutions. In any case, debate and discussion is healthy!

Sent from my iPhone

On Apr 4, 2018, at 1:26 AM, Claire Trenham notifications@github.com wrote:

Also added comments, I haven't added my name because I'm not active in Pangeo and I did used to be an ESGF member so it might look a bit odd, but I definitely endorse this document. My main comment would just be in support of toning down emphasis on commercial cloud, as the same objectives may be achieved with govt-funded cloud adjacent to data repositories (as with NCAR), which in my mind is an optimal solution. For me where the pangeo community can add value is not in putting the data in the cloud, but in leveraging the fact that data is published online (e.g. remotely accessible via opendap) to enable 'big data' analysis workflows in scalable cloud architectures.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

kmpaul commented 6 years ago

@rabernat @hot007 Thanks for the discussion about the role of commercial vs govt-funded cloud prospects. From my perspective at NCAR, I care very deeply about what NCAR's role should be and how best we can support the entire research community (and not simply serve our own interests). To some extend, I am more interested in hearing what other researchers think NCAR's role should be, rather than share my opinion about what I think NCAR's role should be.

mrocklin commented 6 years ago

@kmpaul I proposed the following list:

We believe that an ideal data analytics system for these problems has the following properties:

  1. Low administrative hurdles to sign up and log in, even for new, junior, or industry users
  2. Easy web access for popular interactive environments like Jupyter notebooks
  3. Easy web access on the open internet for automated web services and mobile apps
  4. Dynamic and immediate allocation of interactive compute resources at modest sizes (hundreds rather than millions of cores) even if those sessions may have to grow or shrink during the allocation, depending on external use
  5. Cheap costs, sacrificing the high performance network and rich CPU/Memory ratio of super-computing centers, and replacing them with commodity networking and locally attached storage
  6. Co-location with the relevant datasets

Data analytics clusters are growing within existing computing facilities today that have some (but rarely all) of the properties above

I would be curious to hear your thoughts on if achieving something like this within NCAR is feasible. I'm also curious what constraints are important from your perspective.

rabernat commented 6 years ago

I made some final edits and will submit this very soon. Thanks again for your contributions.

We will definitely recycle some of this content for our position paper.

kmpaul commented 6 years ago

@mrocklin Excellent questions! I will be honest with you that our view on these questions is evolving, but I'll tell you my thoughts on this from my perspective (which I believe is shared by a growing number of people here at NCAR).

Just to start, I'll say that most, if any, constraints that we have will be placed on us by NSF (and NSF's budget for us, which has been fixed for the past 10 years or more).

Regarding your points individually:

  1. I think this is one of the hardest to provide. I believe that people who want access to NCAR computing/analysis resources will always be limited to educational institutions, or those sponsored by educational institutions (e.g., if you are an industry partner). This is a really important issue for NSF. I believe that the application process could be streamlined, but I believe even if it is streamlined, it will look like applications for free research compute on AWS or GCP. I do not believe that NSF will ever allow us to charge users for their time on our systems, so the services would always be provided for free (which then implies the educational or NSF-grant-receiving institution limitation). I think it would be nice to move away from things like Yubikeys and other authentication devices (e.g., in favor of 2FA), but the decisions for that might be above my pay grade. If there are other hurdles that you experience, I'd be very interested in hearing them.
  2. We are working on this. NCAR doesn't have a lot of capacity to support our own gateways, it turns out, but we are working on providing web access to a JupyterHub that has direct access to Cheyenne and our GLADE storage system (for example). A Cheyenne-pointing JupyterHub might be foreseeable in the next month or two.
  3. Hmm. I'm not sure about this one. What automated web services and mobile apps would you like to see for NCAR solutions?
  4. I read this to mean "on-prem cloud," which is something we are working on. I think in the next month or two we will be deploying an experimental on-prem cloud by repurposing one of NCAR's clusters. The configuration for this experimental cloud is still undecided, so I'm soliciting suggestions for what it should look like (e.g., bare-metal kunernetes?). Since it is purely experimental, we may have limited access, but for some of you, I think I could get you access. Based on our experiences with this experimental machine, we will probably stand up a "full service" on-prem cloud later this year or early next year. That would be open to the community.
  5. Personally, I think this is something we need to be considering. At the moment, we are considering simply repurposing older clusters. Moving into the future, I foresee NCAR seriously considering this approach for updated versions of our on-prem cloud. I believe it would be much cheaper for NCAR (which makes NSF happy).
  6. I believe this is the goal for our on-prem cloud. We have our GLADE parallel filesystem, which is our primary data-production location. In the future, I foresee some of our datasets being produced directly on GLADE, and then "published" to commercial cloud. However, some (probably most) datasets will be experimental and never published. I believe (and I think I have backing from those above me) that the Pangeo platform should be the tool used for diagnostics of these experimental datasets, too, and not just for published datasets. That requires making GLADE available to (co-located with) our on-prem cloud, or, at least, making data on GLADE easily ported to our on-prem cloud storage.

Did that answer your questions sufficiently? If you want further details on anything, I can try to get that for you. We (i.e., NCAR) are really trying to move in the cloud-direction, and I believe that the successes of all of you in the Pangeo community are the primary reason for this. So, please, continue to send me your questions and tell us what we should be doing better. (At the very least, I care.)

mrocklin commented 6 years ago

What automated web services and mobile apps would you like to see for NCAR solutions?

Lets say that I do a bunch of analysis on a bunch of satellite data, and turn that into some application that shows people expected flooding patterns based on recent imagery and atmospheric analysis. I turn that into a small web application that I want to expose to the world. I turn my giant model into a few lines of javascript that run on browsers around the world. That javascript needs to read specific bits of the satellite imagery and atmospheric simulations from where my data was stored.

If I did my analysis on the cloud and the data is stored on S3 then this is probably feasible. If I did my analysis at NCAR then I need to find some way to get the relevant datasets outside of NCAR before they have public use. Or I can probably work with NCAR people to make a new product to expose out to the world, but that probably doesn't scale from a personel perspective and introduces a non-trivial administrative burden.

I believe that people who want access to NCAR computing/analysis resources will always be limited to educational institutions

Interestingly a lot of other conversations I've had with similar institutions have been the opposite. Some people have been under strong pressure to make their services accessible to industry so that companies can do a lot of the last-mile work to reach citizens and derive value from what would otherwise be bottled up resources.

This may not be NCAR's role though.

kmpaul commented 6 years ago

What automated web services and mobile apps would you like to see for NCAR solutions?

I see what you are getting at. I know of a project at NCAR that has been trying to address this kind of need, but I feel it is a long way off with our existing infrastructure. If we were to deploy an on-prem cloud, and set it up much like a commercial cloud, we might be able to speed it along. I could not give you a time-frame for this, but I know it is a desired feature that some people here at NCAR are working on. There is currently not enough money for this project, I feel, to move it along quickly...and, like too many NCAR projects, it is in-house. I think that opening that project up to an open development community could speed it up, but I don't know what the likelihood of that might be.

I believe that our dataset "publication" workflow, in the future, will involve automated data movement to the cloud (i.e., outside of the hands of the user). The dataset would be given a DOI, and it could be easily attached to anyone's cloud instance. ...At least, that is one of the goals of our new publication services.

I believe that people who want access to NCAR computing/analysis resources will always be limited to educational institutions

I could be wrong in my assessment of the situation here at NCAR, but the educational institution limitation has always been a mandate for us. I think, in the past, they have seen that as an implementation of NSF's mission to support the university research community.

That said, NSF sounds like they have been relaxing some of those constraints, at least in the context of open development. I know that we at NCAR are trying to build helpful connections with industry partners, but it's a new space for us, and I'm not sure we have a lab-wide policy on it, yet. I could look into this for you, though, and give you a perspective on this issue from those above my station.

rabernat commented 6 years ago

Comment posted!

https://www.geosci-model-dev-discuss.net/gmd-2018-52/#discussion

Thanks for everyone for your enthusiastic participation. Please keep these ideas fresh, as they will be needed for our full-length position paper (#77).

niallrobinson commented 6 years ago

omg there has been a lot of chat! I haven't read it all but...I was interested in what @hot007 and @kmpaul were mentioning with

toning down emphasis on commercial cloud, as the same objectives may be achieved with govt-funded cloud adjacent to data repositories

I don't think this is quite true, to my mind.

Firstly, I'd like to invoke the "horses for courses" clause of saying that govt cloud adjacent to data is a super useful thing to have. However, here's my thinking on why the commercial cloud is fundamentally useful:

An analysts ideal workflow is inherently volatile i.e. the spend most time thinking and have instantaneously returned computations. Therefore the ideal resources they need are volatile. Commercial cloud providers have so many users that all this volatility gets averaged out, so they can provide a cost effective service from their huge compute farms.

If we have a govt cloud, then we have a choice to make. (1) It is so highly speced that everyone can get quick answers to their computations, but there are a lot of idling resources or (2) people's jobs get queued (booo) but resources are used efficiently. Unless you have very very many users on a very very big compute farm, I think this is inevitable.

(of course, your government is much bigger than ours, so it may be that you are talking about something which more fits into (1))

I'm doing a poster that talks about this at EGU if anyone's there! X3.48 EGU2018-9208

kmpaul commented 6 years ago

@niallrobinson

Good point! That is actually the issue that makes Cheyenne and GLADE here at NCAR so affordable, the fact that resources have an extremely high duty factor. If the duty factor were to lower to the point that queues wait times ceased to be an issue, it would cost the NSF too much to justify.

Personally, I think that on-prem/govt cloud is useful primarily for "unpublished" datasets, and that one of the critical steps in publication is moving them to the cloud. I believe I mentioned to some of you in a comment in the Google doc that @rabernat put together for our response, the cost of running GLADE is about $0.0067/GB/month. However, for example, AWS Glacier costs $0.004/GB/month, which might make long-term storage of our "published" datasets affordable to AWS Glacier. (Ignoring transfer and access charges, etc.).

I just want scientists to see the exact same interface, and have the exact same experience, whether they are analyzing published or unpublished datasets. No special knowledge needed!