pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
699 stars 189 forks source link

Position paper for QJRMS #77

Closed rabernat closed 6 years ago

rabernat commented 6 years ago

@alberto-arribas proposed we write a position paper for QJRMS explaining the need for a cloud-based platform for Big Data climate science. While those of us involved in Pangeo are already convinced about this need, such a paper would help us communicate our efforts with the broader community.

pwolfram commented 6 years ago

@rabernat, I am interested in participating and I have a summer student who I plan to have work with the pangeo stack this summer-- this may be a great chance to collaborate!

jgerardsimcock commented 6 years ago

@rabernat When you say cloud-based platform, are you making a distinction between HPC compute resources and cloud based storage and a full cloud-compute and cloud-storage system? It seems like much of the work done by pangeo users in the former and not the latter, although it looks like we may have some of the latter soon (#71).

rabernat commented 6 years ago

@pwolfram: great! Let's discuss more at ocean sciences!

@jgerardsimcock: Our goal is to support both traditional HPC and cloud. We already have a working cloud-based environment at http://pangeo.pydata.org/ (this is on Google Cloud Platform)--I encourage you to give it a spin! #71 is about porting that setup to AWS.

fmaussion commented 6 years ago

(deleted my previous post, sorry for the noise)

alberto-arribas commented 6 years ago

Great. I genuinely think that a position paper in which we present evidence of the advantages of a pangeo-like platform for doing scientific work will be extremely useful for our community.

The reason for targeting QJRMS as opposed to BAMS is to make it less a "opinion" paper and more a "fact based research" paper. QJRMS papers have a good track record of becoming technical reference papers in weather/climate so it is a good shout ... Full disclosure: I happen to be an ex-assoc.editor for QJ so have good contacts with the current editorial team. They are keen on the idea :-)

jhamman commented 6 years ago

@rabernat - in your mind, is there anything that is keeping us from writing this paper now?

JiaweiZhuang commented 6 years ago

This is highly relevant to our group's project so I am very interested. I am not sure how can I directly contribute to this since I am not a "core Pangeo developer", but you might find our project a useful reference/case-study. I put some essential information here for reference.

As a background, I've been developing the GEOS-Chem chemical transport model for 3+ years. The model is used by 100+ research groups worldwide and contributes to thousands of publications each year. We have been increasing the model resolution, adding chemical mechanisms, and using better HPC software to make the model stay in the state-of-science&technology, but this also makes the model much more complicated and harder to use. For an average user, even setting up software environment and downloading input data can take many weeks.

To address those challenges, we are trying to build operational support for our model on AWS. AWS is pretty supportive and agreed to host our model's all input data for free, as part of their public datasets. Our project is fully documented here: http://cloud-gc.readthedocs.io. You can also find it on our model's main website. The doc contains detailed AWS tutorials, written in a way to communicate with atmospheric scientists (not technical developers!). It can be a reference for using other models on cloud, since all Earth science models are highly similar from a software perspective.

I feel that Pangeo can really help our model users, especially for the data analysis step. People tend to spend much longer time analyzing data than running simulations. When the simulation is done and the data get uploaded to S3, users no longer need to launch VMs and can simply log into Pangeo. GEOS-Chem data are fully compatible with the xarray ecosystem. Even the legacy "BPCH" format works with xarray thanks to the great xbpch tool written by @darothen.

When our project gets more mature, it might bring a large flux of new Pangeo users :)

----- update ----- By the way, our model's input data are the MERRA-2 reanalysis and the GEOS-FP near-real time product of NASA GMAO. Besides using them to drive GEOS-Chem simulation, people can also analyze the data directly (as they already did in research papers). The data bucket can be found here.

JiaweiZhuang commented 6 years ago

PS: Should the thread title be "QJRMS" not "QGRMS"?

niallrobinson commented 6 years ago

A couple of people I've spoken to at EGU have said that the tech is all well and good, but that they think the real challenges are what you might call the "business processes". Specifically, how do you make this secure and, related, how do you stop people spending too much. Additionally, they were of the view that using cloud is not cost effective.

I had no arithmetic to refute this at the time (although in the pub @jhamman seemed to have some good stats), but my gut feeling is that it will be more cost effective. Especially if we evaluate the true cost of on-prem resources. I think there might be a significant audience who would like to see that discussion (albeit briefly) as part of this paper.

Another thing that comes to mind which people ask about is the use of TCP instead of MPI - "If you are using TCP on your big posh HPC, isn't that waste of very fast interconnect". Again, I hear that answer is "not really", but that will surprise a lot of onlookers.

mrocklin commented 6 years ago

Another thing that comes to mind which people ask about is the use of TCP instead of MPI - "If you are using TCP on your big posh HPC, isn't that waste of very fast interconnect". Again, I hear that answer is "not really", but that will surprise a lot of onlookers.

For simulation this would probably be true, especially if you need very low-latency connections. TCP on Infiniband is a pretty well-known technology and decent at high bandwidth. Somewhat simple experiments show that we're able to saturate Cheyenne's infiniband (3GB/s) with standard dask array computations.

Alternatively, if you're looking to be a bit snarky, one answer to this question is "yes, for data analysis the very fast interconnect was probably a waste, we didn't really need it in the first place. We can use it just fine though if that was your concern."

rabernat commented 6 years ago

We began to address the cost issue in our short comment.

The cost of cloud compute is very low. It's the storage that kills us. However, to calculate the total cost of the alternatives to cloud, you have to figure out how many groups have created their own "dark repositories" of CMIP-type data. Since the current ESGF infrastructure doesn't allow users to compute next to the data, they have to download it to local computers. At LDEO, we have over 1 PB of data under people's desks. How many times is this pattern repeated across the whole field? What is the total sum of that cost, including facilities, personnel, and capital depreciation? Even if a single dark repository costs 1/10 as much as cloud, there might be 100 dark repositories or more.

The challenge as I see it is to trigger some sort of phase transition for the community, whereby people abandon their dark repositories in favor of cloud. That is a social problem, one that Pangeo is poised to lead on.

rabernat commented 6 years ago

Also, the game changer could be if Amazon or Google decided to provide a significant amount of hosting for free. They have already done this for other scientific data sets, and they would do it for us if they thought it would bring them significant business...which it would if the whole community suddenly switched to cloud. Another chicken vs. egg problem...

niallrobinson commented 6 years ago

Also, the game changer could be if Amazon or Google decided to provide a significant amount of hosting for free.

on it ;-) I said something along the lines is "CMIPs going to be 150PBs" at the poster session last night to some relevant, nameless people, and they sort of turned white and said the conversation needed to up a pay grade 😄

The challenge as I see it is to trigger some sort of phase transition for the community, whereby people abandon their dark repositories in favor of cloud. That is a social problem, one that Pangeo is poised to lead on.

totally agree. One technical thing this raises is the concept of lazy product which we were talking about on our call - then we can make the argument that storage is expensive but because if that, you weight towards only storing original canonical datasets, and recalculating a lot of products as and when.

So to be clear - we've got compelling answers to all these things, but sometimes I think we are so excited about how profound the tech solution can be, that people think that we might be overlooking this other stuff, so we should head them off.

rabernat commented 6 years ago

"CMIPs going to be 150PBs"

This is not longer accurate. According the the new CMIP6 infrastructure paper, it's more like 18 PB. The core dataset which will be replicated at all ESMF nodes is ~2 PB. This are much more manageable numbers.

niallrobinson commented 6 years ago

thank $*%& for that! Yeh - that's much more like it.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rsignell-usgs commented 6 years ago

Is there in fact any pangeo position paper in the works?

niallrobinson commented 6 years ago

im still up for this if others are (be gone, Stale Bot)

niallrobinson commented 6 years ago

Maybe this is something to thrash out as part of !!PANGEOCON2018!!

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.