Open philippGraf opened 3 years ago
Yes I also would appreciate an example on saving/loading workflow, I am also experiencing often
NotFoundError: Unsuccessful TensorSliceReader constructor
when trying to load a checkpoint.
Hello! Did anybody manage to get a tutorial/understand what is happening, and would be interested in sharing BCS checkpoints don't work very well for me either and I don't know how to use them :)
Hello!
a tutorial for proper setup of experiments, saving, logging and loading would be much appreciated! I run into problems restoring checkpoints:
Currently I am using the following setup:
import acme from acme import wrappers from acme.agents.tf import dqn import tensorflow as tf from acme import specs import sonnet as snt from acme.testing import fakes import numpy as np import acme.tf.networks as networks import acme.agents.tf.r2d2 as r2d2
flags.DEFINE_integer('n_episodes', 1000, 'number of games')
FLAGS = flags.FLAGS
def main(_):
if name == 'main': app.run(main)
cd ~/acme/fun python main.py -acme_id=fun
python main.py -acme_id=rnn-buffer 2021-03-22 15:17:13.564136: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I0322 15:17:19.799459 139818441774912 csv.py:45] Logging to learner/rnn-buffer/logs/logs.csv 2021-03-22 15:17:19.802444: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-22 15:17:19.803318: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-03-22 15:17:19.845075: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected 2021-03-22 15:17:19.845119: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (philipp-HP-ZBook-x2-G4): /proc/driver/nvidia/version does not exist 2021-03-22 15:17:19.845558: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-03-22 15:17:19.846011: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set [reverb/cc/platform/tfrecord_checkpointer.cc:144] Initializing TFRecordCheckpointer in /tmp/tmpovile0aq [reverb/cc/platform/tfrecord_checkpointer.cc:338] Loading latest checkpoint from /tmp/tmpovile0aq [reverb/cc/platform/default/server.cc:55] Started replay server on port 19581 WARNING:tensorflow:Entity <function _yield_value at 0x7f29b91c0510> appears to be a generator function. It will not be converted by AutoGraph. W0322 15:17:20.917869 139818441774912 ag_logging.py:146] Entity <function _yield_value at 0x7f29b91c0510> appears to be a generator function. It will not be converted by AutoGraph. 2021-03-22 15:17:21.381026: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-03-22 15:17:21.399016: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1999965000 Hz 2021-03-22 15:17:21.423113: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. I0322 15:17:21.447039 139818441774912 savers.py:166] Attempting to restore checkpoint: /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner/ckpt-4 I0322 15:17:22.869575 139818441774912 csv.py:45] Logging to environment_loop/rnn-buffer/logs/logs.csv INFO:tensorflow:Assets written to: /home/philipp/acme/rnn-buffer/snapshots/network/assets I0322 15:17:23.409248 139818441774912 builder_impl.py:775] Assets written to: /home/philipp/acme/rnn-buffer/snapshots/network/assets I0322 15:17:23.414844 139818441774912 savers.py:156] Saving checkpoint: /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 168 | Steps = 170 | Steps Per Second = 409.001 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 556 | Steps = 561 | Steps Per Second = 371.802 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 956 | Steps = 961 | Steps Per Second = 438.964 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 1387 | Steps = 1392 | Steps Per Second = 419.263 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 1814 | Steps = 1824 | Steps Per Second = 396.063 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 2244 | Steps = 2256 | Steps Per Second = 472.704 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 2675 | Steps = 2691 | Steps Per Second = 428.340 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 3100 | Steps = 3118 | Steps Per Second = 458.795 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 3530 | Steps = 3550 | Steps Per Second = 453.733 [Environment Loop] Episode Length = 1 | Episode Return = 0.0 | Episodes = 3942 | Steps = 3963 | Steps Per Second = 477.548 2021-03-22 15:17:36.672807: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_tensor.cc:175 : Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/philipp/acme/rnn-buffer/checkpoints/r2d2_learner/ckpt-4 Traceback (most recent call last): File "main.py", line 119, in
app.run(main)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "main.py", line 102, in main
loop.run(num_episodes=FLAGS.n_episodes)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/environment_loop.py", line 153, in run
result = self.run_episode()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/environment_loop.py", line 101, in run_episode
self._actor.update()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/tf/r2d2/agent.py", line 148, in update
super().update()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/agent.py", line 87, in update
self._learner.step()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/acme/agents/tf/r2d2/learning.py", line 205, in step
results = self._step()
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 828, in call
result = self._call(*args, kwds)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 871, in _call
self._initialize(args, kwds, add_initializers_to=initializers)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
*args, *kwds))
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
graphfunction, = self._maybe_define_function(args, kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
func_outputs = python_func(func_args, func_kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
out = weak_wrapped_fn().wrapped(*args, *kwds)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3887, in bound_method_wrapper
return wrapped_fn(args, **kwargs)
File "/home/philipp/acme/venv/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.NotFoundError: in user code:
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.1 W0322 15:17:36.994191 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.1 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.2 W0322 15:17:36.994407 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.2 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.3 W0322 15:17:36.994531 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.3 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.4 W0322 15:17:36.994619 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.4 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.5 W0322 15:17:36.994699 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.5 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.6 W0322 15:17:36.994777 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.6 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.7 W0322 15:17:36.994873 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.7 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.m.8 W0322 15:17:36.994951 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.m.8 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.0 W0322 15:17:36.995054 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.0 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.1 W0322 15:17:36.995142 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.1 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.2 W0322 15:17:36.995218 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.2 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.3 W0322 15:17:36.995293 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.3 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.4 W0322 15:17:36.995398 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.4 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.5 W0322 15:17:36.995479 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.5 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.6 W0322 15:17:36.995564 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.6 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.7 W0322 15:17:36.995656 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.7 WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.v.8 W0322 15:17:36.995730 139818441774912 util.py:161] Unresolved object in checkpoint: (root).optimizer.v.8 WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details. W0322 15:17:36.995830 139818441774912 util.py:169] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details. [reverb/cc/platform/default/server.cc:64] Shutting down replay server