Zarrifying S2S data with CESM and SubX

judithberner commented 3 years ago

Hi, I open this issue so that we can talk about the next steps of zarrifying the S2S (SubX and CESM) data in a small group.

Here is a recap from what I understand. Please correct me as I get things wrong. Aaron S. stated that some of the concatenation of the forecasts could be done via intake_esm as was done for cmip6 (and I believe for the LENS data as well).

However, our data is more coherent, so that we might not need this feature but can put different forecasts into one zarr objects. We might have to keep some models separate, because they can have different start /initialization times (e.g. Mon or Wed weekly).

Aaron S. suggests two options: 1) one model all vars into one zarr (TBs) 2) one model one variable (100 GB) into one zarr object

There is a related issue here (zarrifyng of GEFS data) https://github.com/pangeo-forge/staged-recipes/issues/17 where Ryan Abernathy suggests to "IMO, if the all the variables have the exact same dimensions and coordinates, they should be combined into a single zarr group. In addition to being more convenient for the user, it's more compact, since you don't have to duplicate all the coordinates."

Aaron K. states that they don't plan on using ESM, but possible STAC (based on discussions with Ryan Abernathy).

So in summary what I am hearing is that e.g. the three CESM simulations (all same initialization times) could be put in a single zarr object. All concatenations would happen in the creation of the zarr-file and a single object would be uploaded. Alternatively, we could put each model (all variables and initializations) into a single zarr object. Latter might be preferable for maximal similarity between CESM and SubX simulations.

Please let me know your thoughts.

Climpred team: @aaronspring CESM team: @abjaye, IRI team: @ikhomyakov, @awrobertson, @aaron-kaplan

ikhomyakov commented 3 years ago

Hi All, we also need to decide on chunking strategy (chunk dimensions and approximate chunk sizes).

judithberner commented 3 years ago

@aaronspring to get to the CESM2 data ssh into casper@ucar.edu with your username and duo password (let me know if you have problems getting in). The CESM2 data is at /glade/campaign/cesm/development/cross-wg/S2S/CESM2/S2SHINDCASTS SubX classes the variables into three priorities: p1, p2, p3. They are stored in the different directories: p1: (10 variables) pr rlut tas_2m ts ua_200 ua_850 va_200 va_850 zg_200 zg_500 p2: (19 variables) cape hfss_sfc mrro psl rzsm snc sty_sfc tasmin_2m uvas wap_500 hfls_sfc huss_850 mrso rad_sfc sic stx_sfc tasmax_2m ua_100 va_100 p3: (14 variables) ta_10 ta_100 ta_30 ta_50 ua_10 ua_30 ua_50 va_10 va_30 va_50 zg_10 zg_30 zg_50 zg_850

The NCAR jupyterhub https://jupyterhub.ucar.edu/ should see this directory. Note that you have to sign into casper, not cheyenne (you can't see the campaign storage from cheyenne). Scratch is here: /glade/scratch/aspring but I will also asked for write permissions to /glade/campaign/cesm/development/cross-wg/S2S/aspring Abby has a lot of the intermediate zarr files at /glade/campaign/cesm/development/cross-wg/S2S/jaye

judithberner commented 3 years ago

@aaronspring Let me and @abjaye know how to help. Presumably in the end we want a script that reads in all the data (variables p1-p3, member, init) for CESM2 and then converts them to zarr. We should probably start with one variable only to figure out the chunking.

judithberner commented 3 years ago

Hi All, we also need to decide on chunking strategy (chunk dimensions and approximate chunk sizes). Yes, @ikhomyakov! @aaronspring and I discussed this a little Here is one of his statements: "From climpred point of view, we loop internally over lead time, so lead time is anyways chunked down by climpred." I assume this would generalize to a typical use of S2S data. Plus, we will use climpred in the tutorial scripts. Also see the script we are using (from Riley Brady), which has chunking information https://github.com/judithberner/climpred_CESM1_S2S/blob/main/0.01_concatenate_S2S.ipynb

aaronspring commented 3 years ago

@aaron-kaplan how are you using STAC? As a way to index all your IRI data in the cloud? Or specifically for SubX from IRI?

What we need for the summer school is an easy and convenient way to access SubX zarr stores in the cloud. intake-esm has been convenient to concat individual simulation outputs into higher dimensionality because they just took CMIP6 nc output and converted each to a zarr. We could do the same and use intake-esm to con at files together but this will lead to fine chunking which we will not like.

However with SubX we can upload a preprocessed multi-dim zarr. That would be much better https://github.com/pangeo-forge/staged-recipes/issues/17#issuecomment-766872196. The end result will look like the IRIs opendap for SubX with dimensions S for initialisation, M for member and al for lead, I.e. much more user friendly.

On Monday I will take a look on the cesm1 S2S data on glade and try how large multi dim zarrs per model would work with intake-esm. Maybe intake-esm wouldn’t be needed then and could be replace by a very simple open_subx function?

aaron-kaplan commented 3 years ago

@aaronspring To be clear, we are currently not using STAC at all. But our tentative plan is, as you said, to use it to index all of the IRI Data Library in the cloud.

Thanks for explaining how you've been using ESM to combine multiple simulation outputs into a single entity. I agree that merging multiple simulation outputs into a single Zarr store seems like a better approach.

judithberner / climpred_CESM1_S2S

Zarrifying S2S data with CESM and SubX #1