marbl-ecosys / HiRes-CESM-analysis

Notebooks and tools for validating the 0.1 degree POP / CICE run with ocean BGC
http://hires-cesm-analysis.dokku.projectpythia.org/Interactive_Dashboard
5 stars 7 forks source link

Expand checks in compare_ts_and_hist notebooks #33

Open mnlevy1981 opened 4 years ago

mnlevy1981 commented 4 years ago

These notebooks are designed to verify that the time series files we generate are bit-for-bit identical with the history files produced by the model. Right now, the notebooks rely on diag_metadata.yaml to determine which variables are compared, which means

  1. Only a subset of variables from pop.h are checked
  2. For the 3D fields listed in the YAML file, we only check a subset of the vertical levels
  3. The other streams (pop.h.nday1, pop.h.nyear1, cice.h, cice.h1) are not checked at all

Perhaps a smart parallelization technique would make it feasible to check all variables across all streams?

mnlevy1981 commented 4 years ago

As of d604c92 in #29 I am no longer running da.identical() to compare data, but I am verifying that time series files for every variable in the CESM history files exist. This is done for all five streams: pop.h, pop.h.nday1, pop.h.nyear1, cice.h, and cice.h1.

I tried running

history_filenames = case.get_history_files(year, stream)
# open_mfdataset_kwargs: data_vars="minimal", compat="override", coords="minimal", parallel=True
ds_hist = xr.open_mfdataset(history_filenames, **open_mfdataset_kwargs)
# vars_to_check = [var for var in ds_hist.data_vars if "time" in ds_hist[var].coords and var != "time_bound"]
vars_to_check = ["TEMP"]
for var in vars_to_check:
    timeseries_filenames = case.get_timeseries_files(year, stream, var)
    ds_ts = xr.open_mfdataset(timeseries_filenames, **open_mfdataset_kwargs)
#   limiting comparison to single level works fine
#    da_hist = ds_hist[var].isel(z_t=0)
#    da_ts = ds_ts[var].isel(z_t=0)
#   comparing full 3D field blows memory, even with dask (cluster.scale(12))
    da_hist = ds_hist[var]
    da_ts = ds_ts[var]
    if da_hist.identical(da_ts):
        print(f"{var} is the same in history and time series")
    else:
        print(f"{var} is DIFFERENT in history and time series")

and, as the inline comments indicate, was blowing memory even with cluster.scale(12) while comparing a single level was fine in serial or parallel. In fact, I saw modest performance gains from running in parallel:

with isel(z_t=0)
----
Parallel, cluster.scale(n=8):
CPU times: user 4.28 s, sys: 92.3 ms, total: 4.38 s
Wall time: 16.4 s

Serial:
CPU times: user 19.7 s, sys: 3.17 s, total: 22.9 s
Wall time: 25.1 s