pangeo-data / pangeo-datastore

Pangeo Cloud Datastore
https://catalog.pangeo.io
48 stars 16 forks source link

Adding SubX retrospective forecasts for verification with climpred #121

Closed aaronspring closed 3 years ago

aaronspring commented 3 years ago

Forecast data: SubX (subseasonal predictions) is hosted at IRIDL via opendap, not in cloud yet, this a proposal to put SubX into pangeo cloud

Verification data: ERA5 daily, resampled from ERA5 hourly available at pangeo cloud

Result: prediction skill over 45 days

Software: climpred https://github.com/pangeo-data/climpred

To be put in catalog: https://github.com/pangeo-data/pangeo-datastore/blob/master/intake-catalogs/atmosphere.yaml

Works locally one subsetted data, but the workflow with downloading is slow and annoying.

I would suggest to start with one or two variables like tas and precip from the four Models initialised on Wednesdays.( Checking their counterparts in ERA5 beforehand.)

@kpegion Are we allowed to put SubX data from IRI to the pangeo cloud?

@pangeo is this a dataset you consider to host?

kpegion commented 3 years ago

The SubX Team would love to have SubX data in the cloud. We have not had the resources to do this ourselves. If there are resources available (human and computing) or we can work together to obtain resources, then we are agreeable and supportive.

The data itself is public and can be put in the Pangeo Cloud (or anywhere) as long as appropriate reference is made to the data and its original source as indicated here: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/

One thing I have yet to understand with data in the cloud is when does it cost the user to get the data? For example, will users have to pay to get and process SubX data in the Pangeo cloud? If not initially, will there eventually be a cost?

Another question, SubX data is constantly growing with new forecasts and new models added. Will this be kept up to date in the Pangeo cloud?

aaronspring commented 3 years ago

I have limited understanding about the costs. Users need provide a billing account https://github.com/pangeo-data/pangeo-datastore/blob/master/README.md

But apparently within the us-central region there are no costs. https://github.com/pangeo-data/pangeo-datastore/issues/78#issuecomment-580751011

charlesbluca commented 3 years ago

On the topic of billing, most of the data in Pangeo's Google Cloud project is in requester pays buckets, which means that the user must specify a Google Cloud project (either their own or Pangeo's) to be billed for egress when accessing data. This was done to cut down on egress fees, which are free if accessing the data through another Google Cloud product in the same region, but much more expensive if the data is accessed through any outside networks (i.e. your own personal instance of JupyterLab). Typically, we tend to start data off in requester pays buckets, and turn it off only if the data is a use case where the egress isn't going to be a problem.

As for the cataloging of the data, I am interested - are these datasets able to be opened in an xarray container, or do they require climpred specifically? If the former is true, then we can easily add them to the atmosphere catalog; if the latter, then we will need to add them using a class constructor of climpred as the driver (an example of this here, with ESM data collections), and also have some motivation to potentially make an Intake plugin to support climpred datasets down the line.

kpegion commented 3 years ago

Thanks @charlesbluca. This is very helpful. Does this mean that if I access the data via Google Collaboratory rather than my own Jupyter instance, then it would be free?

SubX is its own project separate from climpred consisting of hindcast and forecast data for a collection of models. The data are stored in the IRI data library and accessible via opendap using xarray. Climpred is a package that calculates skill scores on the hindcasts. It is not required for accessing SubX data.

charlesbluca commented 3 years ago

With Colab I think it depends - I couldn't find too much info, but this Stack Overflow post seems to imply that a Colab instance's region is somewhat random, and could potentially even be a region outside of the United States, which would still incur some amount of egress fees. You could potentially terminate and restart an instance until it was in your desired region, but that's not ideal.

Typically, users are able to access the data from a regional Google compute instance through Pangeo's JupyterHub on Google Cloud. There's more info here, but essentially we would give you authentication through GitHub to access the JupyterHub, and then you could interact with the data from there without incurring any egress fees. It would, however, still require some additional authentication on Pangeo's Google Cloud project to access the bucket containing the data if it is requester pays.

Good to know that it is accessible using xarray - in that case, the process of adding the data should be:

I'd be happy to help with any steps of this if you'd still like to upload the data!

rabernat commented 3 years ago

Hi All--I'm sorry it has taken me so long to reply to this thread. I support 💯 the idea of bringing the SubX data into the cloud and would love to collaborate around this.

Let me try to clear up a few points of confusion. Let me first say that it is very reasonable to be confused because our documentation and instructions about data are currently quite inconsistent and somewhat out of date. Everything about how we store and catalog data in the cloud is under heavy churn right now.

1. There is no such thing as "in Pangeo cloud" when it comes to data

There is just "in the cloud." The term "Pangeo Cloud" refers primarily to our cloud-based JupyterHub deployments in Google Cloud and AWS.

Putting data into the cloud means placing the data in a cloud storage service, such as Amazon S3, Google Cloud Storage, or Wasabi. If the permissions are sufficiently open (e.g. data are public), anyone on the internet can access the data. But access will be faster from within the same cloud region as where the data are stored. Most of the datasets in our legacy cloud data repository (this repo) are in Google Cloud region US-Central-1; the same region where our jupyterhub https://us-central1-b.gcp.pangeo.io/ runs.

2. There are three different possibilities when it comes for paying.

Cloud storage is billed by the GB minute. One TB per year costs approximately $250 in Google Cloud. Additionally there are Egress Costs when data moves out of the cloud.

There are three different options we can consider for SubX.

  1. Your project pays for everything. Based on what you said above, it sounds like we are talking about possible up to $1000 / year for storage costs.
  2. Pangeo bears the costs. We are supported to do this via EarthCube can definitely provide hosting support if needed.
  3. The cloud provider bears the costs. Both AWS and Google have Public Data programs. You have to make an application to get your dataset added, and the data have to be public / free to use.

We are happy to help with either 1 or 2. But 3 is really the best option. We could support you if you want to make an application to one of the public dataset programs.

3. This repo is basically deprecated. Going forward, all our efforts on cloud data are happening via the Pangeo Forge project

The approach described in our current docs and on this website has proven not to be very scalable / sustainable. It is too much work to manually create these Zarr stores and upload them. Plus there is this issue:

Will this be kept up to date in the Pangeo cloud?

This is one of the problems that Pangeo Forge aims to solve. You can read more about Pangeo Forge in our Roadmap.

I would much prefer to ingest SubX data via Pangeo Forge, rather than create another one-off dataset that we can't update. The first step would be for you to fill out this questionaire, essentially with the exact same info @aaronspring provided in https://github.com/pangeo-data/pangeo-datastore/issues/121#issue-757908225

4. I'm unclear about what the compute needs are for this project

Why do you want the data in the cloud? Who is going to be accessing the cloud data? How will they be working with it?

aaronspring commented 3 years ago

Regarding 4.:

Why cloud? Data access: data is so large that computation is far from being interactive. Computation works on HPC with 1M tasks works (working on better chunking) but data downloading before takes ages. Also the computation of probabilistic metrics seems to create too many tasks. Also subsetted data just for Europe takes a day for one variable of Four Models.

How will access? Kathys group?, S2S summer school that got cancelled?, and the Seasonal to subseasonal prediction community. I‘d give it a try for European predictability.

Forecast verification is not limited to climpred, but my biased view proposes to use climpred.

rabernat commented 3 years ago

Are Aneesh Subramanian and Judith Berner involved in this project? I am having deja-vu. We are having a parallel conversation with them over email. I think they are applying to the AWS public data program for what they call the "S2S database". Is that the same thing as SubX?

kpegion commented 3 years ago

Regarding 4.:

Why cloud? Data access: data is so large that computation is far from being interactive. Computation works on HPC with 1M tasks works (working on better chunking) but data downloading before takes ages. Also the computation of probabilistic metrics seems to create too many tasks. Also subsetted data just for Europe takes a day for one variable of Four Models.

The ability to do remote calculation and only bring back what you need or want to keep is of great value. One basic example is that many people only wish to have ensemble mean anomalies for only a specific region. Right now, using the IRIDL, one would have to download the entire hindcast for a model, subset the region, make ensemble mean, calculate climatology and anomalies. Most people don't have the disk space for the entire dataset.

How will access? Kathys group?, S2S summer school that got cancelled?, and the Seasonal to subseasonal prediction community. I‘d give it a try for European predictability.

S2S summer school is not cancelled, but postponed to summer 2021. There is interest from the summer school in accessing data in the cloud for hand-on activities. SubX also has a large user community doing research and applications-based work with it. The dataset is rather large and challenging to deal with. Putting it in the cloud and providing example notebooks that make it easier to access the data would be of great benefit.

Forecast verification is not limited to climpred, but my biased view proposes to use climpred.

lots of people will use various forecast verification and or diagnostic codes that they have themselves. It will help to showcase climpred and what it can do.

Are Aneesh Subramanian and Judith Berner involved in this project? I am having deja-vu. We are having a parallel conversation with them over email. I think they are applying to the AWS public data program for what they call the "S2S database". Is that the same thing as SubX?

Yes, Aneesh and Judith are organizing the S2S summer school and want to put the NCAR hindcast data (some of which is already part of SubX and some of which is not) in the cloud.

I realize this is all very confusing. Hopefully I can clarify:

SubX is a multi-model subseasonal hindcast project with a real-time forecast component informing US week 3-4 forecasts (http://cola.gmu.edu/subx/). It consists of US and Canadian models. Some are research models, some are operational models. It is used for both informing operational forecasts and subseasonal predictability, variability, and process research.

The S2S database is part of an international project (http://s2sprediction.net/) containing subseasonal hindcasts from many operational centers around the world. It contains operational models (no research models) and does not provide data in real-time. It is primarily used for research subseasonal predictability, variability, and process research.

Lots of the people involved in SubX are also involved in and using the S2S database as well.

One more thing to note is that there are currently restrictions on redistribution of data from the S2S database. SubX has no such restrictions as the data are already entirely in the public domain.

rabernat commented 3 years ago

Ok, this all sounds great.

This is an operational dataset of commercial value. Plus, it looks like subx is supported by NOAA. So it really should be a no-brainer to add it to the AWS NOAA Big Data Project S3 archive. That is definitely the best path forward IMO.

@zflamig of AWS might be able to give us an idea who is the right point of contact for this.

kpegion commented 3 years ago

@rabernat Thanks for your help. I've been wanting to figure out how to get SubX data in the cloud for some time, but its all confusing and not clear where to go for answers.

rabernat commented 3 years ago

its all confusing and not clear where to go for answers.

TRUE. Cloud data is very powerful and has so much potential. But the processes for managing are--how to say this nicely?--still evolving! 😆

rabernat commented 3 years ago

Also, regardless of who pays and which cloud it goes in, Pangeo can definitely help on the technical side with producing and uploading the cloud optimized data. Hopefully Pangeo Forge will make it very easy.

What's your timeline?

aaronspring commented 3 years ago

I had no clue about Judiths initiative but can support on software verification and examples side. Chunking -1 in the S dimension of initialisation (or at least fees in S) is important

kpegion commented 3 years ago

Also, regardless of who pays and which cloud it goes in, Pangeo can definitely help on the technical side with producing and uploading the cloud optimized data. Hopefully Pangeo Forge will make it very easy.

What's your timeline?

SubX is somewhat resource limited right now as our current funding is coming to an end, so we have no specific timeline for the SubX project as a whole. I will reach out to my program manager about resources through the NOAA Big Data Project to help support this effort.

I believe however that @bradyrx @aaronspring were wanting to get a subset in sooner to demonstrate climpred capabilities. This subset would be a good demonstration for getting the entire SubX database in the cloud and would support the S2S summer school, so probably Spring to Early Summer time frame is best for this?

I also have a parallel conversation going with Aneesh and Judith, so we should probably find a way to merge the conversations. Everyone should just use Github :-)

zflamig commented 3 years ago

You should absolutely submit this to the AWS Open Data Program. There is a dataset application form at https://application.opendata.aws that you can fill out to get the process rolling. Typically only data that NOAA is directly providing comes in through the NOAA Big Data Program, however the AWS Open Data Program (formerly the AWS Public Dataset Program, we recently renamed it because everyone called it the open data program anyways) is the equivalent thats open to everyone.

judithberner commented 3 years ago

Finally connected with this thread. Hi all! We are using climpred for S2S verification of simulations with CESM with and without stochastic parameterizations. We would like to extend these scripts to read in SubX and WMO/S2S data and use for the 2021 NCAR ASP summer colloquium on S2S-science. We will have the data for the summer colloquium on a spinning disk, but are also looking to put it in the cloud. I have been talking to WMO/S2S co-chairs and SubX co-chairs and it seems the Amazon data policy is compatible with NOAA, but not WMO/S2S requirements. The recently updated workflow from @bradyrx creates zarr-files, so a conversation of the SubX data to zarr-format is streamlined. For us, there are two ways forward: a) temporary cloud space for the summer school, b) contribute to making SubX data an 'open data' dataset.

zflamig commented 3 years ago

Hi @judithberner can you send me an email zlf@amazon and we can discuss? There are a few options available and we can probably find something that will work for your needs.

judithberner commented 3 years ago

Hi @zflamig! Sorry for the delay. I had to secure resources for being able to convert ~100TB of netcdf data to zarr and find disk space, expertise and human resources to do so. I just submitted the dataset application form (attached). @kpegion and @aaronspring Please comment on the application. @zflamig Please let us know the next steps. AWS_summary

aaronspring commented 3 years ago

looks good to me.

kpegion commented 3 years ago

Looks good to me too. Thanks Judith!

On Fri, Jan 22, 2021 at 11:48 AM Aaron Spring notifications@github.com wrote:

looks good to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-datastore/issues/121#issuecomment-765614792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4TRBN6QXXZPTVWFAAWXT3S3HB65ANCNFSM4UPIPFJQ .

kpegion commented 3 years ago

One more piece of clarification:

The license information is here (scroll to the bottom): http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/ There is also a dataset reference along with the BAMS paper reference. Both should be referenced. You already listed the BAMS paper reference. The dataset reference is: https://search.datacite.org/works/10.7916/d8pg249h

On Fri, Jan 22, 2021 at 12:08 PM Kathy Pegion kathy.pegion@gmail.com wrote:

Looks good to me too. Thanks Judith!

On Fri, Jan 22, 2021 at 11:48 AM Aaron Spring notifications@github.com wrote:

looks good to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-datastore/issues/121#issuecomment-765614792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4TRBN6QXXZPTVWFAAWXT3S3HB65ANCNFSM4UPIPFJQ .

judithberner commented 3 years ago

Our request was granted and we are in the process of determining if we should maintain the data through NCAR’s existing open data sponsorship program, NOAA Big Data or IRI.

rabernat commented 3 years ago

This is great! In the meantime, Pangeo Forge has been progressing. Pangeo Forge is a cloud-based tool to help with the production and ingestion of cloud-optimized data. It could be used, for example to populate the cloud bucket of your choice with cloud-optimized Zarr data.

If you'd like to read more about Pangeo Forge, you can do so here: https://pangeo-forge.readthedocs.io/en/latest/

If you're interested in using it for ingesting your data, please open an issue here: https://github.com/pangeo-forge/staged-recipes/issues

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aaronspring commented 3 years ago

Issue is closed. S2S data in the ECMWF cloud: access via https://github.com/ecmwf-lab/climetlab-s2s-ai-competition

SubX data in the making: further preparatory details, see https://github.com/judithberner/climpred_CESM1_S2S. plan to upload SubX to AWS