leap-stc / ClimSim

An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators.
https://leap-stc.github.io/ClimSim/
Apache License 2.0
134 stars 40 forks source link

High-res zarr products - build tracking thread #38

Open cisaacstern opened 1 year ago

cisaacstern commented 1 year ago

I am currently building zarr products for the high-res data. Opening this thread so we have a public place to track progress on these efforts. By way of background:

My second full-scale attempt at running these jobs has now been running for a little over 2 days:

image

The first time I tried this they crashed after 3 days, and I think I fixed the bug that caused that crash. So if this attempt just works, they'll be done by early next week I'd guess. If these jobs crash, I'll restart them early next week and then maybe the next shot we'd have is for end of next week (budgeting a couple days per attempt).

cisaacstern commented 1 year ago

Monday update: of the two jobs left running over the weekend, the mlo job apparently succeeded, whereas the mli job failed:

Screen Shot 2023-08-21 at 4 01 17 PM

Still working on debugging the cause of the mli failure. As for mlo, the output dataset can be opened as shown below. A few caveats:

And a few notes on things that seem to have worked (please correct me if anything here seems inaccurate):

The mlo (prelim/preview only, no guarantees yet! 😄 ) dataset can be loaded as follows:

import xarray as xr
path = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5882522942-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(path, engine="zarr", chunks={})  # requires `gcsfs`, takes ~4 mins on my laptop
ds.nbytes / 1e12  # -> 13.36924905984 TB
len(ds.time)  # -> 210240
ds.state_t.attrs  # -> {'long_name': 'Air temperature', 'units': 'K'}
ds
``` Dimensions: (time: 210240, ncol: 21600, lev: 60) Coordinates: * time (time) object 0001-02-01 00:00:00 ... 0009-01-31 23:40:00 Dimensions without coordinates: ncol, lev Data variables: (12/16) cam_out_FLWDS (time, ncol) float64 dask.array cam_out_NETSW (time, ncol) float64 dask.array cam_out_PRECC (time, ncol) float64 dask.array cam_out_PRECSC (time, ncol) float64 dask.array cam_out_SOLL (time, ncol) float64 dask.array cam_out_SOLLD (time, ncol) float64 dask.array ... ... state_q0003 (time, lev, ncol) float64 dask.array state_t (time, lev, ncol) float64 dask.array state_u (time, lev, ncol) float64 dask.array state_v (time, lev, ncol) float64 dask.array tod (time) int32 dask.array ymd (time) int32 dask.array Attributes: calendar: NO_LEAP fv_nphys: 2 ne: 30 ```
duncanwp commented 2 months ago

Hey @cisaacstern - I'd love to use this version of ClimSim so I can just grab a spatial slice of the data, but I can't seem to access the above URL. Did you resolve the issue in the end? Is there a new zarr url I can use (hopefully for both mlo and mli)?

cisaacstern commented 2 months ago

Hi @duncanwp! I haven't been keeping up with this particular issue lately, @jbusecke may have some insight!

jbusecke commented 1 month ago

Howdie @duncanwp. I have moved all climsim related ingestion stuff to https://github.com/leap-stc/climsim_feedstock

As you can tell from https://github.com/leap-stc/climsim_feedstock/pull/7, I am still struggling with ingestions the lowres data! I am hesitant to even try the highres data until then.

There is some ClimSim data in gs://leap-persistent-ro/sungdukyu, but I am unsure if it is the lowres or highres (maybe @sungdukyu or @SammyAgrawal can provide clarity).

Please let me know if this is urgent to you and I can shift priorities to try to get this to work.

I also opened a PR to add climsim into our catalog (https://catalog.leap.columbia.edu). We are not able to share links to specific datasets quite yet (tracking that in https://github.com/leap-stc/data-management/issues/129), so for any future updates I recommend checking the catalog periodically!

duncanwp commented 1 month ago

Brilliant, thanks @jbusecke. I'll keep an eye on that repo, but it's not urgent as I can work around it for now.

SammyAgrawal commented 1 month ago

The gs://leap-persistent-ro/sungdukyu cloud bucket contains the low resolution data, specifically the first 8 years.