Closed rabernat closed 6 years ago
Hi @rabernat - I think that's a great idea.
First off, I've just read the abstract, so this is subject to reading the actual paper :D
I think something advocating for what we've been over-dramatically calling !!THE PANGEO HYPOTHESIS!! would be useful ie. thin client, very elastic, lazy evaluation. My feeling is that this kind of functionality then moves us on to the next batch of challenges, which would be good to raise in such a response i.e. data discovery; automatic hypothesis generation etc. We have a book chapter under review currently which explains a lot of this. If it's out in time it would be useful to point to.
Basically - we'd very much up for being involved in writing something like that
That's a great idea, @rabernat. I think the paper is quite nicely written and broaches many key issues, hopefully many members of our community can carve out some time to review it in detail and share notes.
Two angles come to mind for an official comment:
Broad community accessibility to the ESGF replicas; is this development pattern consistent with our view of how our respective communities will move towards accessing and analyzing data? What is the gap - or rather, the opportunity - to be bridged here?
"Future science" is alluded to in the final section (p. 25, lines 5-7) in the couched terms of "data analytics at large scale." I think this drives most of our motivation with Pangeo (certainly my efforts!) but perhaps a critical review from the early practitioners of this discipline within our midsts would be of value to the broader community. Can we identify, in a preliminary sense, where the data infrastructure plan for CMIP6 falls short of empowering those new scientific analyses and approaches? This paper is clearly set up to be revisited a few years down the line to vet predictions - we should aim for our own predictions, too.
Looking forward to hearing others' comments and ideas! I would be enthusiastic about contributing to a draft response, even with my limited spare time at the moment.
Thanks for sharing the paper! Some very specific thoughts:
1. About computational platforms
CMIP6 will have a serious data transfer issue as discussed by Dart et al. (2017). The Balaji et al. 2018 paper doesn't say much about this issue, but it does say about "secondary repositories" (p22, line 5):
"Based on experience in CMIP5, it is expected that a number of “special interest” secondary repositories will hold selected subsets of CMIP6 data outside of the ESGF federation. This will have the effect of widening data accessibility geographically, and by user communities, with obvious benefit to the CMIP6 program. The WIP encourages the support of these secondary repositories where it does not undermine CMIP6 data management and integrity objectives."
I expect some public clouds will be used to increase data accessibility, considering that part of CMIP5 data have been hosted on AWS since 2013. If so, the Pangeo project will be very relevant since it is pretty cloud-centric.
2. About software tools
ESMValTool seems to be the official tool for analyzing CMIP data, but it might not integrate very well with cloud platforms.
Take regridding as an example (since I am most familiar with it). Regridding is an important pre-processing step especially for ocean data, and the paper says a lot about it (p11 line 10):
"Regridding remains a contentious topic, and owing to a lack of consensus, the WIP recommendations on regridding remain in flux. The CMIP6 Output Grid Guidance document outlines a number of possible recommendations, including the provision of “weights” to a target grid. Many of the considerations around regridding, particularly for ocean data in CMIP6, are discussed at length in Griffies et al. (2016).:"
ESMValTool contains many regridding functionalities. By skimming through its source code I believe it relies on Iris which calls ESMPy internally. However, ESMPy's native parallelism is MPI+horizontal decomposition (JiaweiZhuang/xESMF#3) which do not play well with the distributed computing environment on the cloud. xESMF will be able to use dask pretty natively (not yet, but close...) and should work better with the cloud.
I have drafted a comment on this paper https://docs.google.com/document/d/1UagKt9RedKSkJkRseOc84XKj5QmFlnl27opdCyRaKJ8/edit?usp=sharing
I welcome any edits anyone wants to make. If you agree with what I wrote, don't hesitate to add your name (even if you don't make any changes). The comments are technically just single author, but we can sign the comment collectively.
There are a lot of good ideas raised in this thread that might be beyond the scope of this brief comment (which really should be about the paper at hand). I recommend we take these up in a longer publication via #77.
Beyond those on this thread, I would love to get some input from @naomi-henderson, who has extensive experience processing CMIP datasets.
Hi Ryan, Fantastic! I love your response and would like to add my name. Cheers, Naomi
On Mon, Apr 2, 2018 at 10:30 PM, Ryan Abernathey notifications@github.com wrote:
I have drafted a comment on this paper https://docs.google.com/document/d/1UagKt9RedKSkJkRseOc84XKj5QmFl nl27opdCyRaKJ8/edit?usp=sharing
I welcome any edits anyone wants to make. If you agree with what I wrote, don't hesitate to add your name (even if you don't make any changes). The comments are technically just single author, but we can sign the comment collectively.
There are a lot of good ideas raised in this thread that might be beyond the scope of this brief comment (which really should be about the paper at hand). I recommend we take these up in a longer publication via #77 https://github.com/pangeo-data/pangeo/issues/77.
Beyond those on this thread, I would love to get some input from @naomi-henderson https://github.com/naomi-henderson, who has extensive experience processing CMIP datasets.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/179#issuecomment-378108401, or mute the thread https://github.com/notifications/unsubscribe-auth/AeANV7JDCEcnLeiCiW9Fo5EWVmKYEi_yks5tkt7jgaJpZM4S4xon .
Yeh - I think this is spot on, so have taken the liberty of adding my name! I've also made a couple of comments which you can take or leave - I'd be very happy for it to go in as is.
I have also made comments and added my name. This is hugely representative of the mindset we are trying to foster!
Thanks for doing this, Ryan! I think this is fantastic!
@rabernat - nicely done. I made some minor changes and added a few comments for discussion.
Looks great! I'm going to add some comments as well this evening, will ping back here once they're in. Thanks for taking the initiative on this!
Thanks for your comments everyone!
Please don't hesitate to just directly edit (or "suggest edit") the document to insert your changes. I will leave this up for the night and then merge and submit it tomorrow.
We clearly have enough ideas for a full-length position paper (#77). I'll get that started asap in order to keep the momentum going.
Comments added; really fantastic work and discussion from everyone.
Also added comments, I haven't added my name because I'm not active in Pangeo and I did used to be an ESGF member so it might look a bit odd, but I definitely endorse this document. My main comment would just be in support of toning down emphasis on commercial cloud, as the same objectives may be achieved with govt-funded cloud adjacent to data repositories (as with NCAR), which in my mind is an optimal solution. For me where the pangeo community can add value is not in putting the data in the cloud, but in leveraging the fact that data is published online (e.g. remotely accessible via opendap) to enable 'big data' analysis workflows in scalable cloud architectures.
Claire and others: do not hesitate to add your name to the document if you wish. There is no “official” membership in Pangeo.
I appreciate the diversity of opinions on the commercial cloud. My own opinion is this: based on my recent experience, I am extremely excited about its potential. We have already achieved things that would have been impossible on smaller academic clouds. I agree that a more agnostic tone is appropriate for this document. But for the future, my bet is on commercial cloud over government / academic solutions. In any case, debate and discussion is healthy!
Sent from my iPhone
On Apr 4, 2018, at 1:26 AM, Claire Trenham notifications@github.com wrote:
Also added comments, I haven't added my name because I'm not active in Pangeo and I did used to be an ESGF member so it might look a bit odd, but I definitely endorse this document. My main comment would just be in support of toning down emphasis on commercial cloud, as the same objectives may be achieved with govt-funded cloud adjacent to data repositories (as with NCAR), which in my mind is an optimal solution. For me where the pangeo community can add value is not in putting the data in the cloud, but in leveraging the fact that data is published online (e.g. remotely accessible via opendap) to enable 'big data' analysis workflows in scalable cloud architectures.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@rabernat @hot007 Thanks for the discussion about the role of commercial vs govt-funded cloud prospects. From my perspective at NCAR, I care very deeply about what NCAR's role should be and how best we can support the entire research community (and not simply serve our own interests). To some extend, I am more interested in hearing what other researchers think NCAR's role should be, rather than share my opinion about what I think NCAR's role should be.
@kmpaul I proposed the following list:
We believe that an ideal data analytics system for these problems has the following properties:
Data analytics clusters are growing within existing computing facilities today that have some (but rarely all) of the properties above
I would be curious to hear your thoughts on if achieving something like this within NCAR is feasible. I'm also curious what constraints are important from your perspective.
I made some final edits and will submit this very soon. Thanks again for your contributions.
We will definitely recycle some of this content for our position paper.
@mrocklin Excellent questions! I will be honest with you that our view on these questions is evolving, but I'll tell you my thoughts on this from my perspective (which I believe is shared by a growing number of people here at NCAR).
Just to start, I'll say that most, if any, constraints that we have will be placed on us by NSF (and NSF's budget for us, which has been fixed for the past 10 years or more).
Regarding your points individually:
Did that answer your questions sufficiently? If you want further details on anything, I can try to get that for you. We (i.e., NCAR) are really trying to move in the cloud-direction, and I believe that the successes of all of you in the Pangeo community are the primary reason for this. So, please, continue to send me your questions and tell us what we should be doing better. (At the very least, I care.)
What automated web services and mobile apps would you like to see for NCAR solutions?
Lets say that I do a bunch of analysis on a bunch of satellite data, and turn that into some application that shows people expected flooding patterns based on recent imagery and atmospheric analysis. I turn that into a small web application that I want to expose to the world. I turn my giant model into a few lines of javascript that run on browsers around the world. That javascript needs to read specific bits of the satellite imagery and atmospheric simulations from where my data was stored.
If I did my analysis on the cloud and the data is stored on S3 then this is probably feasible. If I did my analysis at NCAR then I need to find some way to get the relevant datasets outside of NCAR before they have public use. Or I can probably work with NCAR people to make a new product to expose out to the world, but that probably doesn't scale from a personel perspective and introduces a non-trivial administrative burden.
I believe that people who want access to NCAR computing/analysis resources will always be limited to educational institutions
Interestingly a lot of other conversations I've had with similar institutions have been the opposite. Some people have been under strong pressure to make their services accessible to industry so that companies can do a lot of the last-mile work to reach citizens and derive value from what would otherwise be bottled up resources.
This may not be NCAR's role though.
What automated web services and mobile apps would you like to see for NCAR solutions?
I see what you are getting at. I know of a project at NCAR that has been trying to address this kind of need, but I feel it is a long way off with our existing infrastructure. If we were to deploy an on-prem cloud, and set it up much like a commercial cloud, we might be able to speed it along. I could not give you a time-frame for this, but I know it is a desired feature that some people here at NCAR are working on. There is currently not enough money for this project, I feel, to move it along quickly...and, like too many NCAR projects, it is in-house. I think that opening that project up to an open development community could speed it up, but I don't know what the likelihood of that might be.
I believe that our dataset "publication" workflow, in the future, will involve automated data movement to the cloud (i.e., outside of the hands of the user). The dataset would be given a DOI, and it could be easily attached to anyone's cloud instance. ...At least, that is one of the goals of our new publication services.
I believe that people who want access to NCAR computing/analysis resources will always be limited to educational institutions
I could be wrong in my assessment of the situation here at NCAR, but the educational institution limitation has always been a mandate for us. I think, in the past, they have seen that as an implementation of NSF's mission to support the university research community.
That said, NSF sounds like they have been relaxing some of those constraints, at least in the context of open development. I know that we at NCAR are trying to build helpful connections with industry partners, but it's a new space for us, and I'm not sure we have a lab-wide policy on it, yet. I could look into this for you, though, and give you a perspective on this issue from those above my station.
Comment posted!
https://www.geosci-model-dev-discuss.net/gmd-2018-52/#discussion
Thanks for everyone for your enthusiastic participation. Please keep these ideas fresh, as they will be needed for our full-length position paper (#77).
omg there has been a lot of chat! I haven't read it all but...I was interested in what @hot007 and @kmpaul were mentioning with
toning down emphasis on commercial cloud, as the same objectives may be achieved with govt-funded cloud adjacent to data repositories
I don't think this is quite true, to my mind.
Firstly, I'd like to invoke the "horses for courses" clause of saying that govt cloud adjacent to data is a super useful thing to have. However, here's my thinking on why the commercial cloud is fundamentally useful:
An analysts ideal workflow is inherently volatile i.e. the spend most time thinking and have instantaneously returned computations. Therefore the ideal resources they need are volatile. Commercial cloud providers have so many users that all this volatility gets averaged out, so they can provide a cost effective service from their huge compute farms.
If we have a govt cloud, then we have a choice to make. (1) It is so highly speced that everyone can get quick answers to their computations, but there are a lot of idling resources or (2) people's jobs get queued (booo) but resources are used efficiently. Unless you have very very many users on a very very big compute farm, I think this is inevitable.
(of course, your government is much bigger than ours, so it may be that you are talking about something which more fits into (1))
I'm doing a poster that talks about this at EGU if anyone's there! X3.48 EGU2018-9208
@niallrobinson
Good point! That is actually the issue that makes Cheyenne and GLADE here at NCAR so affordable, the fact that resources have an extremely high duty factor. If the duty factor were to lower to the point that queues wait times ceased to be an issue, it would cost the NSF too much to justify.
Personally, I think that on-prem/govt cloud is useful primarily for "unpublished" datasets, and that one of the critical steps in publication is moving them to the cloud. I believe I mentioned to some of you in a comment in the Google doc that @rabernat put together for our response, the cost of running GLADE is about $0.0067/GB/month. However, for example, AWS Glacier costs $0.004/GB/month, which might make long-term storage of our "published" datasets affordable to AWS Glacier. (Ignoring transfer and access charges, etc.).
I just want scientists to see the exact same interface, and have the exact same experience, whether they are analyzing published or unpublished datasets. No special knowledge needed!
A new discussion paper has appeared on Geoscientific Model Development entitled Requirements for a global data infrastructure in support of CMIP6
https://www.geosci-model-dev-discuss.net/gmd-2018-52/
The format of this journal allows for public comment on these papers. It would be great for us to draft an official response from the pangeo perspective that summarizes our wishes and recommendations regarding CMIP6 data.
Thoughts?