tutorial datasets - Githubissues

jhamman commented 6 years ago

We are planning to use Cheyenne for most of this tutorial but have also discussed using Pangeo on GCP. We should mirror any tutorial datasets that we plan to use on Cheyenne on GCP (as zarr datasets).

First step. Identify tutorial datasets. @kmpaul thoughts?

kmpaul commented 6 years ago

Good question. Ideally, I would choose datasets (and tutorial exercises) that focused on attendee wants. However, it's hard to gather that information, as it seems we will not know who has (or has not) signed up for the tutorial before hand. So, it could be a very mixed crowd.

Given that, I think we should choose datasets with which we are familiar. And since I'm not a domain scientist, I'm not that familiar with any datasets. Maybe you know of some that we can use, but I think it's more important that we can get them to GCP. And maybe getting them to GCP is harding than pulling down from GCP to Cheyenne? I'm not sure. I've never pushed a dataset to the cloud.

The datasets should be large enough to need 100 cores, but small enough to make doing operations on them doable within the tutorial. I'm not sure I have a feel for that, yet. Do you have suggestions on that front? For example, if we took a high-resolution dataset, but pulled only some of the "important" variables out (e.g., for atmospheric data maybe temperature, pressure, precip, wind speeds...?), we could reduce the dataset size to something transferable to/from the cloud (instead of many TB).

Is that reasonable? Do you have any suggestions? (We can also pull anything from the RDA: https://rda.ucar.edu/, which is currently down due to problems with Glade, but hopefully will be up again soon.)

kmpaul commented 6 years ago

@jhamman The RDA is back up, if you want to look for particular datasets that might be useful. The RDA can also point you to paths on Glade where these collections exist.

Here are some ideas:

CMIP 5 dataset and code for R parallelization: Could be useful as an example dataset for parallelization, though small (2.8GB). https://rda.ucar.edu/datasets/ds810.0/
Global Hourly 0.5-degree Land Surface Air Temperature Datasets: Kinda randomly selected as a large dataset (1.5TB). https://rda.ucar.edu/datasets/ds193.0/
NCEP Climate Forecast System Version 2 (CFSv2) Monthly Products: Good example of multi-variable dataset for monthly prediction analysis (though I know nothing about what analysis people might do on it). It's pretty big (1TB). https://rda.ucar.edu/datasets/ds094.2/
NCAR Community Earth System Model, EaSM Project Dataset: CMIP5 CAM data. Very large! (46TB) https://rda.ucar.edu/datasets/ds316.0/

There are others. Whatever we choose, we should identify their locations on Glade (and if we need them on the cloud, we need to transfer them).

jhamman commented 6 years ago

I think we want to land in the 1TB range. The global hourly 0.5 deg temperature dataset looks good. Let's focus our examples on that (unless there are other suggestions).

kmpaul commented 6 years ago

Sounds good. I'll get pointers to the files on Glade.

kmpaul commented 6 years ago

Ok. Fortunately, the datasets can be found on Glade at:

/glade/p/rda/data/

followed by the dataset reference number (e.g., ds193.0 for the Global Hourly 0.5-degree land surface air temperature). Under that directory, there will be "product" directories (each dataset can hold multiple products). So, for example, if you want the NCAR/NCEP product (NRA) in the ds193.0 dataset, then the netCDF files will be found here:

/glade/p/rda/data/ds193.0/NRA/*.nc

I didn't realize that there were multiple products in each dataset before I went searching, so I'm sorry about that. The ds193.0 dataset has 4 similar products (spanning different lengths of time):

ERA40 (1958 - 2009)
ERAInt (1979 - 2009)
MERRA (1979 - 2009)
NRA (1948 - 2009)

There is only 1 variable in this dataset (T2M: "2 meter air temperature"), and it is 2+1D (lat, lon, time).

jhamman commented 6 years ago

done (for now) in 19cb09e291fc7f78dd3e493383fc43b68bd6ef4a

pangeo-data / pangeo-tutorial-sea-2018

tutorial datasets #2