Open mnlevy1981 opened 4 years ago
As of d604c92 in #29 I am no longer running da.identical()
to compare data, but I am verifying that time series files for every variable in the CESM history files exist. This is done for all five streams: pop.h
, pop.h.nday1
, pop.h.nyear1
, cice.h
, and cice.h1
.
I tried running
history_filenames = case.get_history_files(year, stream)
# open_mfdataset_kwargs: data_vars="minimal", compat="override", coords="minimal", parallel=True
ds_hist = xr.open_mfdataset(history_filenames, **open_mfdataset_kwargs)
# vars_to_check = [var for var in ds_hist.data_vars if "time" in ds_hist[var].coords and var != "time_bound"]
vars_to_check = ["TEMP"]
for var in vars_to_check:
timeseries_filenames = case.get_timeseries_files(year, stream, var)
ds_ts = xr.open_mfdataset(timeseries_filenames, **open_mfdataset_kwargs)
# limiting comparison to single level works fine
# da_hist = ds_hist[var].isel(z_t=0)
# da_ts = ds_ts[var].isel(z_t=0)
# comparing full 3D field blows memory, even with dask (cluster.scale(12))
da_hist = ds_hist[var]
da_ts = ds_ts[var]
if da_hist.identical(da_ts):
print(f"{var} is the same in history and time series")
else:
print(f"{var} is DIFFERENT in history and time series")
and, as the inline comments indicate, was blowing memory even with cluster.scale(12)
while comparing a single level was fine in serial or parallel. In fact, I saw modest performance gains from running in parallel:
with isel(z_t=0)
----
Parallel, cluster.scale(n=8):
CPU times: user 4.28 s, sys: 92.3 ms, total: 4.38 s
Wall time: 16.4 s
Serial:
CPU times: user 19.7 s, sys: 3.17 s, total: 22.9 s
Wall time: 25.1 s
These notebooks are designed to verify that the time series files we generate are bit-for-bit identical with the history files produced by the model. Right now, the notebooks rely on
diag_metadata.yaml
to determine which variables are compared, which meanspop.h
are checkedpop.h.nday1
,pop.h.nyear1
,cice.h
,cice.h1
) are not checked at allPerhaps a smart parallelization technique would make it feasible to check all variables across all streams?