Adding ClimSim recipe - Githubissues

jbusecke commented 1 year ago

This is a handoff PR for @cisaacstern, probably messy.

Aiming to close #28

cisaacstern commented 1 year ago

pre-commit.ci autofix

cisaacstern commented 1 year ago

👋 @sungdukyu I'm currently working on this Pangeo Forge recipe to build a Zarr copy of ClimSim.

I have a question for you regarding the number of time steps. The dataset README indicates that simulations were saved at 20 minute intervals for 10 simulated years. Browsing the ClimSim_high-res repo, I see directories for months 0001-02 through 0009-01, which AFAICT represent 8, not 10, simulated years.

I believe the following code accurately captures the date range present on Hugging Face (1200 second, i.e. 20 minute, samples for months 0001-02 through 0009-01). This gives me 420,768 files per dataset, not 525,600 as indicated in the README.

In [1]: import pandas as pd

In [2]: times = pd.date_range('0001-02-01', '0009-02-01', freq='1200S', unit='s', inclusive='left')

In [3]: times[:3]
Out[3]: 
DatetimeIndex(['0001-02-01 00:00:00', '0001-02-01 00:20:00',
               '0001-02-01 00:40:00'],
              dtype='datetime64[s]', freq='1200S')

In [4]: times[-3:]
Out[4]: 
DatetimeIndex(['0009-01-31 23:00:00', '0009-01-31 23:20:00',
               '0009-01-31 23:40:00'],
              dtype='datetime64[s]', freq='1200S')

In [5]: len(times) * 2
Out[5]: 420768

Am I correctly understanding what files exist on Hugging Face? If so, is the intention to add an additional two years to Hugging Face, or is the README perhaps inaccurate?

Apologies if I am missing something obvious here, which is entirely possible!

sungdukyu commented 1 year ago

@cisaacstern You are correct, we only released the dataset for the first 8 years. The following 2 years (which are missing from the Hugging Face repos) are reserved for the test sets. The test sets will be released only after a planned Kaggle competition. Sorry for the inconvenience, we will make sure to update the README soon!

cisaacstern commented 1 year ago

@sungdukyu awesome! Good to know I wasn't missing anything.

Just friended you on LinkedIn and according to your page there you live in Santa Monica? I live in Santa Monica as well, we should grab coffee sometime 😃 . What's the best way to be in touch offline? I'm cstern@ldeo.columbia.edu.

cisaacstern commented 1 year ago

As of the latest commit, I can run the recipe in climsim.py locally via

with beam.Pipeline() as p:
    p | climsim

And generate a zarr store that is openable, like so

# here, `td` is the `tempdir.TemporaryDirectory` defined in `climsim.py`
ds = xr.open_dataset(td.name + "/climsim-highres-train.zarr", engine="zarr", consolidated=True)

Full xarray.Dataset repr

``` Dimensions: (time: 2, ncol: 21600, lev: 60) Coordinates: * time (time) int64 102010 102011200 Dimensions without coordinates: ncol, lev Data variables: cam_in_ALDIF (time, ncol) float64 ... cam_in_ALDIR (time, ncol) float64 ... cam_in_ASDIF (time, ncol) float64 ... cam_in_ASDIR (time, ncol) float64 ... cam_in_ICEFRAC (time, ncol) float64 ... cam_in_LANDFRAC (time, ncol) float64 ... cam_in_LWUP (time, ncol) float64 ... cam_in_OCNFRAC (time, ncol) float64 ... cam_in_SNOWHICE (time, ncol) float64 ... cam_in_SNOWHLAND (time, ncol) float64 ... cam_out_FLWDS (time, ncol) float64 ... cam_out_NETSW (time, ncol) float64 ... cam_out_PRECC (time, ncol) float64 ... cam_out_PRECSC (time, ncol) float64 ... cam_out_SOLL (time, ncol) float64 ... cam_out_SOLLD (time, ncol) float64 ... cam_out_SOLS (time, ncol) float64 ... cam_out_SOLSD (time, ncol) float64 ... mli_state_q0001 (time, lev, ncol) float64 ... mli_state_q0002 (time, lev, ncol) float64 ... mli_state_q0003 (time, lev, ncol) float64 ... mli_state_t (time, lev, ncol) float64 ... mli_state_u (time, lev, ncol) float64 ... mli_state_v (time, lev, ncol) float64 ... mli_tod (time) int32 ... mli_ymd (time) int32 ... mlo_state_q0001 (time, lev, ncol) float64 ... mlo_state_q0002 (time, lev, ncol) float64 ... mlo_state_q0003 (time, lev, ncol) float64 ... mlo_state_t (time, lev, ncol) float64 ... mlo_state_u (time, lev, ncol) float64 ... mlo_state_v (time, lev, ncol) float64 ... mlo_tod (time) int32 ... mlo_ymd (time) int32 ... pbuf_CH4 (time, lev, ncol) float64 ... pbuf_COSZRS (time, ncol) float64 ... pbuf_LHFLX (time, ncol) float64 ... pbuf_N2O (time, lev, ncol) float64 ... pbuf_SHFLX (time, ncol) float64 ... pbuf_SOLIN (time, ncol) float64 ... pbuf_TAUX (time, ncol) float64 ... pbuf_TAUY (time, ncol) float64 ... pbuf_ozone (time, lev, ncol) float64 ... state_pmid (time, lev, ncol) float64 ... state_ps (time, ncol) float64 ... Attributes: calendar: NO_LEAP fv_nphys: 2 ne: 30 ```

Next steps:

[ ] Remove local testing code from PR and run on Dataflow via GitHub Action
[ ] Make sure generated dates conform to noleap calendar
[ ] Figure out what xarray backend to use for opening directly from a cloud cache (without using copy_to_local=True + netcdf4); see inline comment.

Items for discussion:

For variable names shared between mli and mlo, I've prepended the dataset type (either mli_ or mlo_) to the variable name in the output zarr. This way all of the data can live in a single zarr namespace. Do we think this is preferable to the alternative of two separate zarr stores for mli and mlo?
The source files appear to encode time as data variables ymd + tod. I've parsed those values into a time dimension here, but the format of the values is crude. What is the best format to parse them into, np.datetime64, something else?

rabernat commented 1 year ago

Rather than mlo_state_t, I would prefer state_t_out

sungdukyu commented 1 year ago

Variable names are already published in the ClimSim paper, so I am against changing the variable names.
For time, np.datetime64 is fine.

rabernat commented 1 year ago

Two points:

we can't keep the same names and combine the two datasets, since they use the same variable names. So we would need to have two separate datasets
CF conventions recommend not relying on variable names for anything. Quantities should be identified using metadata, e.g. standard_name.

sungdukyu commented 1 year ago

I am unsure if combining mli and mlo into a single zarr is necessary. It is customary to provide input and output of ML datasets in separate files.

cisaacstern commented 1 year ago

Seems like the best path forward is:

mli and mlo as separate zarr stores, without changing variable names
additional metadata (standard_name, etc.) added to the attributes of all variables in each store, using this google sheet as reference.

?

sungdukyu commented 1 year ago

@cisaacstern If there's no particular reason (for performance's sake) to keep mli and mlo into a single zarr, I'd suggest two separate zarrs. I think it's less confusing to keep input and output variables in separate files. And also yes for metadata!

sungdukyu commented 1 year ago

Maybe one useful thing (for mlo zarr) would be converting the output state variables to tendency variables (Check "Preprocessing workflow" on p5 of Climsim paper). But optional.

cisaacstern commented 1 year ago

Maybe one useful thing (for mlo zarr) would be converting the output state variables to tendency variables (Check "Preprocessing workflow" on p5 of Climsim paper). But optional.

Like so:

https://github.com/leap-stc/ClimSim/blob/8c19fa08f51a59c11122ed1d873e641556708b7e/preprocessing/data_utils.py#L168-L169

?

cisaacstern commented 1 year ago

Maybe one useful thing (for mlo zarr) would be converting the output state variables to tendency variables (Check "Preprocessing workflow" on p5 of Climsim paper). But optional.

Like so:

https://github.com/leap-stc/ClimSim/blob/8c19fa08f51a59c11122ed1d873e641556708b7e/preprocessing/data_utils.py#L168-L169

?

I like the idea of taking care of any/all generic preprocessing that all (or most) data users would otherwise have to do on their own. This fulfills the spirit of "analysis ready (AR)" part of "ARCO".

In terms of implementation, I now see that this requires access to both mli & mlo for a given timestep to calculate, which means that I don't think we can include it in the primary Zarr creation pipelines: since we're going to make separate Zarr stores for mli & mlo, we won't have access to both datasets at once during the primary pipelines.

This could be appended as a secondary pipeline, however, to be run after both the mli & mlo primary pipelines have completed, and would perhaps being a nice use case for https://github.com/google/xarray-beam.

sungdukyu commented 1 year ago

@cisaacstern Great!

For variable names for preprocessed tendencies, I suggest the prefix "ptend", e.g., state{q0001,q0002,q0003,t,u,v} to ptend_{q0001,q0002,q0003,t,u,v}.

cisaacstern commented 1 year ago

Thanks @jbusecke for setting me up with access to LEAP Dataflow. I have the pruned mlo recipe running successfully on Dataflow from this PR:

import xarray as xr
mlo = "gs://leap-persistent/data-library/climsim-highres-mlo-595733423-5649002558-1/climsim-highres-mlo.zarr"
ds = xr.open_dataset(mlo, engine="zarr", consolidated=True)

``` Dimensions: (time: 2, ncol: 21600, lev: 60) Coordinates: * time (time) object 0001-02-01 00:00:00 0001-02-01 00:20:00 Dimensions without coordinates: ncol, lev Data variables: (12/16) cam_out_FLWDS (time, ncol) float64 ... cam_out_NETSW (time, ncol) float64 ... cam_out_PRECC (time, ncol) float64 ... cam_out_PRECSC (time, ncol) float64 ... cam_out_SOLL (time, ncol) float64 ... cam_out_SOLLD (time, ncol) float64 ... ... ... state_q0003 (time, lev, ncol) float64 ... state_t (time, lev, ncol) float64 ... state_u (time, lev, ncol) float64 ... state_v (time, lev, ncol) float64 ... tod (time) int32 ... ymd (time) int32 ... Attributes: calendar: NO_LEAP fv_nphys: 2 ne: 30 ```

For the moment, I've left ymd and tod as data variables, which I thought might be useful for quality control of the time dimension preprocessing logic (to make sure the time dimension is being added correctly). Maybe we'll drop those data vars before running the full recipe.

I'm hitting an strange error when running mli which I'll look into debugging now...

File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/rechunking.py",
   line 162, in combine_fragments fragments.sort(key=_sort_index_key) # this should sort by index
   AttributeError: '_ConcatSequence' object has no attribute 'sort'
   [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr/StoreToZarr/Rechunk/MapTuple(combine_fragments)-ptransform-31']

cisaacstern commented 1 year ago

I'm hitting an strange error when running mli which I'll look into debugging now...

File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/rechunking.py",
   line 162, in combine_fragments fragments.sort(key=_sort_index_key) # this should sort by index
   AttributeError: '_ConcatSequence' object has no attribute 'sort'
   [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr/StoreToZarr/Rechunk/MapTuple(combine_fragments)-ptransform-31']

Thanks so much @alxmrs for an in-depth synchronous debugging session on this just now. In general, the hypothesis Alex introduced was (and Alex, correct me if I'm missing something):

This is presumably a serialization bug (it doesn't appear in local execution, and also because it involves beam coders)
Something about defining two pipelines in the same python module might be causing the serialization issue. This is based on the observation that we are defining two nearly identical pipelines which share code. Perhaps something about the shared references in particular was resulting in one of the two pipelines being serialized improperly.
We tried a few things to possibly resolve this on the call, including wrapping the shared transforms in a PTransform subclass, so that the two pipelines would not share a single instance of OpenAndPreprocess. This did not fix the issue.
Next steps we discussed for debugging included isolating the two pipeline objects from one-another somehow, and seeing if the issue would be resolved that way.

The latest commit comments-out the mlo pipeline from this PR, and interestingly (amazingly?) the mli pipeline is still failing with the same error as reported above! This is really something, because AFAICT, the mli and mlo pipelines are nearly identical, and while I can certainly imagine that something specific to the source data for each pipeline could differentiate them somehow, the nature of the error (something to do with coders/type casting) doesn't feel like it should be affected by the contents of the underlying data...

Ok, will continue chipping away at this... planning to pair with @yuvipanda on this on Thursday if I haven't solved it on my own by then...

cisaacstern commented 1 year ago

Given that this appears to have something to do with Beam coders, and IIUC coders are inferred based on type hints (?), I wonder if this comment right above combine_fragments may be a clue:

# TODO: figure out a type hint that beam likes

I'm not sure what about the current hints:

def combine_fragments(
    group: GroupKey, fragments: List[Tuple[Index, xr.Dataset]]
) -> Tuple[Index, xr.Dataset]

beam dislikes, or how to get beam to raise warnings about this. But this may be worth exploring further. Still doesn't answer why mli would encounter this issue where mlo does not...

alxmrs commented 1 year ago

Here is a good Beam-specific type hinting guide.

https://beam.apache.org/documentation/sdks/python-type-safety/

TIL: PCollection is a generic type. E.g. you can write beam.PCollection[int].

If that doesn't help, I think with_input_types or with_output_types inside the composite transform may do the trick.

cisaacstern commented 1 year ago

Here is a good Beam-specific type hinting guide.

https://beam.apache.org/documentation/sdks/python-type-safety/

@alxmrs this doc is incredibly informative! Thank you so much for sharing. I feel like there is a path forward with that I am learning from reading that... will follow up here shortly once I've tried a few more things.

cisaacstern commented 1 year ago

TIL: PCollection is a generic type. E.g. you can write beam.PCollection[int].

This did not work for me.

If that doesn't help, I think with_input_types or with_output_types inside the composite transform may do the trick.

Nor did this.

But I did get mli to run using something else you suggested on our call, @alxmrs, which was manually casting the (improperly decoded) value to a list, like so: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/548/commits/cd2f0f8a4160714c094395bdfaf883c927a99935.

This does unblock the mli recipe here, but it feels like a bit of a code smell to me, insofar as it's handling decoding at the wrong layer of the stack. This does make me curious to try using a custom beam Coder as described in the Python type safety doc linked above. Perhaps I'll unblock this with the trick here, and take the following two actions as well:

open an issue on -recipes re: possibility of a custom coder
try to create an MRE for this issue, to make it easier to work on a "real"/better fix

Thanks again Alex for all your help on this!

cisaacstern commented 1 year ago

Both recipes are now working in pruned form:

```python In [1]: import xarray as xr In [2]: mli = "gs://leap-persistent-ro/data-library/climsim-highres-mli-595733423-5686203778-1/climsim-highres-mli.zarr" In [3]: mlo = "gs://leap-persistent-ro/data-library/climsim-highres-mlo-595733423-5686102662-1/climsim-highres-mlo.zarr" In [4]: ds_mli = xr.open_dataset(mli, engine="zarr", consolidated=True) In [5]: ds_mlo = xr.open_dataset(mlo, engine="zarr", consolidated=True) In [6]: ds_mli Out[6]: Dimensions: (time: 2, ncol: 21600, lev: 60) Coordinates: * time (time) object 0001-02-01 00:00:00 0001-02-01 00:20:00 Dimensions without coordinates: ncol, lev Data variables: (12/29) cam_in_ALDIF (time, ncol) float64 ... cam_in_ALDIR (time, ncol) float64 ... cam_in_ASDIF (time, ncol) float64 ... cam_in_ASDIR (time, ncol) float64 ... cam_in_ICEFRAC (time, ncol) float64 ... cam_in_LANDFRAC (time, ncol) float64 ... ... ... state_q0003 (time, lev, ncol) float64 ... state_t (time, lev, ncol) float64 ... state_u (time, lev, ncol) float64 ... state_v (time, lev, ncol) float64 ... tod (time) int32 ... ymd (time) int32 ... Attributes: calendar: NO_LEAP fv_nphys: 2 ne: 30 In [7]: ds_mlo Out[7]: Dimensions: (time: 2, ncol: 21600, lev: 60) Coordinates: * time (time) object 0001-02-01 00:00:00 0001-02-01 00:20:00 Dimensions without coordinates: ncol, lev Data variables: (12/16) cam_out_FLWDS (time, ncol) float64 ... cam_out_NETSW (time, ncol) float64 ... cam_out_PRECC (time, ncol) float64 ... cam_out_PRECSC (time, ncol) float64 ... cam_out_SOLL (time, ncol) float64 ... cam_out_SOLLD (time, ncol) float64 ... ... ... state_q0003 (time, lev, ncol) float64 ... state_t (time, lev, ncol) float64 ... state_u (time, lev, ncol) float64 ... state_v (time, lev, ncol) float64 ... tod (time) int32 ... ymd (time) int32 ... Attributes: calendar: NO_LEAP fv_nphys: 2 ne: 30 ```

Just going to take a last look at how the production deployment workflow works, and then merge.

jbusecke commented 1 year ago

Amazing work @cisaacstern. Thank you!

leap-stc / data-management

Adding ClimSim recipe #33