pernak18 / g-point-reduction

Jupyter Notebook evolution of RRTMGP g-point reduction (AKA k-distribution optimization) that started with Menno's [k-distribution-opt](https://github.com/MennoVeerman/k-distribution-opt) repo
0 stars 0 forks source link

Diagnostic Output Bug #12

Closed pernak18 closed 3 years ago

pernak18 commented 3 years ago

When working in the LW, FL says:

i got this error at iteration 150 (run was doing 8 iterations from 144 to 152, and had gotten through 149 OK):

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-0c12c3ce6fa2> in <module>
     87     coObj.findOptimal()
     88     if coObj.optimized: break
---> 89     if DIAGNOSTICS: coObj.costDiagnostics()
     90     coObj.setupNextIter()
     91     with open(pickleCost, 'wb') as fp: pickle.dump(coObj, fp)
/global/u1/e/emlawer/emlawer-g-point-reduction/by_band_lib.py in costDiagnostics(self)
   1168 
   1169         outDS['trial_total_cost'] = \
-> 1170             xa.DataArray(self.totalCost, dims=('trial'))
   1171         outNC = '{}/cost_components_iter{:03d}.nc'.format(
   1172             diagDir, self.iCombine)
~/.local/cori/3.8-anaconda-2020.11/lib/python3.8/site-packages/xarray/core/dataset.py in __setitem__(self, key, value)
   1377             )
   1378 
-> 1379         self.update({key: value})
   1380 
   1381     def __delitem__(self, key: Hashable) -> None:
~/.local/cori/3.8-anaconda-2020.11/lib/python3.8/site-packages/xarray/core/dataset.py in update(self, other, inplace)
   3785         """
   3786         _check_inplace(inplace)
-> 3787         merge_result = dataset_update_method(self, other)
   3788         return self._replace(inplace=True, **merge_result._asdict())
   3789 
~/.local/cori/3.8-anaconda-2020.11/lib/python3.8/site-packages/xarray/core/merge.py in dataset_update_method(dataset, other)
    935         priority_arg=1,
    936         indexes=dataset.indexes,
--> 937         combine_attrs="override",
    938     )
~/.local/cori/3.8-anaconda-2020.11/lib/python3.8/site-packages/xarray/core/merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
    590     coerced = coerce_pandas_values(objects)
    591     aligned = deep_align(
--> 592         coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
    593     )
    594     collected = collect_variables_and_indexes(aligned)
~/.local/cori/3.8-anaconda-2020.11/lib/python3.8/site-packages/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    425         indexes=indexes,
    426         exclude=exclude,
--> 427         fill_value=fill_value,
    428     )
    429 
~/.local/cori/3.8-anaconda-2020.11/lib/python3.8/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
    341                     "arguments without labels along dimension %r cannot be "
    342                     "aligned because they have different dimension sizes: %r"
--> 343                     % (dim, sizes)
    344                 )
    345 
ValueError: arguments without labels along dimension 'trial' cannot be aligned because they have different dimension sizes: {91, 92}

the way the code in the notebook works is cost computation, optimization determination, diagnostics, write pickle file for iteration, then write the flux and reduced k-distribution. in this case, the cost and optimization was done for iteration 149, but the failure is in the diagnostics, so no diagnostic, flux, or k-distribution netCDFs are written.

pernak18 commented 3 years ago

in the output diagnostics netCDF, these are the arrays with the trial dimension: dCost_* and trial_total_cost, with dCost_* existing for every component of the cost function. at least one of these guys must not have all of the trials appended to it

pernak18 commented 3 years ago

this looks to be a problem for all subsequent iterations, but only in the diagnostics. using Eli's data, i can try to reproduce the bug:

% pwd
/global/u1/p/pernak18/RRTMGP/g-point-reduction
% rm -rf xsecs-test/ workdir_band_*
% for WD in `ls -d ~emlawer/emlawer-g-point-reduction/workdir_band_*/`; do ln -s $WD; done
% cp $SCRATCH/RRTMGP/LW_cost-optimize-iter148.pickle LW_cost-optimize.pickle 
% ln -s ~emlawer/emlawer-g-point-reduction/fullCF_top-layer/

then run the notebook with FL's LW cost function, with DIAGNOSTICS = True and NITER = 149, and the error can be reproduced.

pernak18 commented 3 years ago

fixed with c823b550d14004fab0e59fcbb8b21a5429b1e32d

if we reach full reduction in a given band (nGpt = 1), we pop out the associated trial from the cost lists and re-evaluate the optimization (otherwise nothing would happen and we would end up in an infinite loop). the bug was that we didn't pop out the trial from the cost components and delta-cost component arrays, but we did for totalCost, so there was an inconsistency in the trial dimension.

this is first time we have gotten to the point of full-band reduction, so i would not be surprised if other similar bugs manifest