markvdw / gpflow-monitor

Methods to help with logging GPflow optimisation.
Apache License 2.0
11 stars 6 forks source link

Usage of hist path and restore path #7

Open imsrgadich opened 6 years ago

imsrgadich commented 6 years ago
class StoreSession(Task):
    def __init__(self, sequence, trigger: Trigger, session: tf.Session, hist_path, saver=None,
                 restore_path=None):

I was trying to understand how the hist_path and restore_path works. When the run the notebook for the first time it runs, but second time it gives checkpoint error. Can you please explain what is happening here. thank you.

Restoring session from `../results/test/checkpoint-8000`.
2018-04-16 13:52:47.266559: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key RBF-f918534f-0/lengthscales/unconstrained/Adam not found in checkpoint
2018-04-16 13:52:47.266638: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key Gaussian-337de6c9-3/variance/unconstrained/Adam_1 not found in checkpoint
2018-04-16 13:52:47.266696: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key Gaussian-337de6c9-3/variance/unconstrained not found in checkpoint
2018-04-16 13:52:47.266818: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key RBF-f918534f-0/variance/unconstrained not found in checkpoint
2018-04-16 13:52:47.266883: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key RBF-f918534f-0/lengthscales/unconstrained not found in checkpoint
2018-04-16 13:52:47.267309: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key RBF-f918534f-0/lengthscales/unconstrained/Adam_1 not found in checkpoint
2018-04-16 13:52:47.267402: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key Gaussian-337de6c9-3/variance/unconstrained/Adam not found in checkpoint
2018-04-16 13:52:47.268096: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key RBF-f918534f-0/variance/unconstrained/Adam not found in checkpoint
2018-04-16 13:52:47.268142: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key RBF-f918534f-0/variance/unconstrained/Adam_1 not found in checkpoint
2018-04-16 13:52:47.268284: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/feature/Z/unconstrained/Adam_1 not found in checkpoint
2018-04-16 13:52:47.268383: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/feature/Z/unconstrained not found in checkpoint
2018-04-16 13:52:47.268432: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/feature/Z/unconstrained/Adam not found in checkpoint
2018-04-16 13:52:47.268922: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/q_mu/unconstrained not found in checkpoint
2018-04-16 13:52:47.269221: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/q_mu/unconstrained/Adam not found in checkpoint
2018-04-16 13:52:47.269716: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/q_sqrt/unconstrained/Adam_1 not found in checkpoint
2018-04-16 13:52:47.269765: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/q_sqrt/unconstrained/Adam not found in checkpoint
2018-04-16 13:52:47.269804: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/q_sqrt/unconstrained not found in checkpoint
2018-04-16 13:52:47.270633: W tensorflow/core/framework/op_kernel.cc:1192] Not found: Key SVGP-af68c594-7/q_mu/unconstrained/Adam_1 not found in checkpoint
Traceback (most recent call last):
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Key RBF-f918534f-0/lengthscales/unconstrained/Adam not found in checkpoint
     [[Node: save/RestoreV2_4 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_4/tensor_names, save/RestoreV2_4/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/l/gadichs1/gitrepos/aalto/Deep_Spectral_Kernels/demo/test.py", line 51, in <module>
    StoreSession((x * 1000 for x in itertools.count()), Trigger.ITER, m.enquire_session(), "../results/test/checkpoint")
  File "/m/home/home1/18/gadichs1/data/Downloads/temp/gpflow-monitor/gpflow_monitor/opt_tools.py", line 88, in __init__
    self.saver.restore(session, restore_path)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1666, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key RBF-f918534f-0/lengthscales/unconstrained/Adam not found in checkpoint
     [[Node: save/RestoreV2_4 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_4/tensor_names, save/RestoreV2_4/shape_and_slices)]]

Caused by op 'save/RestoreV2_4', defined at:
  File "/l/gadichs1/gitrepos/aalto/Deep_Spectral_Kernels/demo/test.py", line 51, in <module>
    StoreSession((x * 1000 for x in itertools.count()), Trigger.ITER, m.enquire_session(), "../results/test/checkpoint")
  File "/m/home/home1/18/gadichs1/data/Downloads/temp/gpflow-monitor/gpflow_monitor/opt_tools.py", line 75, in __init__
    self.saver = tf.train.Saver(max_to_keep=3) if saver is None else saver
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1218, in __init__
    self.build()
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1227, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
    build_save=build_save, build_restore=build_restore)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
    restore_sequentially, reshape)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
    [spec.tensor.dtype])[0])
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/l/gadichs1/conda_envs/deepgps_2_3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key RBF-f918534f-0/lengthscales/unconstrained/Adam not found in checkpoint
     [[Node: save/RestoreV2_4 = RestoreV2[dtypes=[DT_DOUBLE], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_4/tensor_names, save/RestoreV2_4/shape_and_slices)]]

Process finished with exit code 1
imsrgadich commented 6 years ago

Is it because when the code exit with error, the checkpoint file is not saved properly? and then when it restores it can't find the params?

I need to delete the previously created files ./results/temp/ and ./results/temp/tensorboard/ and it runs fine! (the reason might be obvious, but I'm unable to figure this out.)

wil-j-wil commented 6 years ago

@imsrgadich in case you or anyone else is still stuck on this, you need to use gpflow.defer_build() and explicitly name your model so that the new session uses the same variables names. See the "Creating a GPFlow model" section in the example in the GPFlow repo: https://github.com/GPflow/GPflow/blob/master/doc/source/notebooks/monitor-tensorboard.ipynb