google / dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
https://github.com/google/dopamine
Apache License 2.0
10.42k stars 1.36k forks source link

Visualizing Error - Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator? #165

Closed RylanSchaeffer closed 3 years ago

RylanSchaeffer commented 3 years ago

I'm trying to visualize a Dopamine C51 TF agent I trained. Following the tutorial, I'm trying:

# @title Generate the video
from dopamine.utils import example_viz_lib
num_steps = 1000  # @param {type:"number"}
example_viz_lib.run(agent='rainbow', game='SpaceInvaders', num_steps=num_steps,
                    root_dir='/tmp/agent_viz', restore_ckpt='/tmp/tf_ckpt-199',
                    use_legacy_checkpoint=True)

I pass restore_ckpt=/home/rylan/Documents/PehlevanLab-Dopamine/tmp/dopamine_run/c51/Pong/1/checkpoints/ckpt.199 and I get the following error:

2021-01-06 16:05:31.710619: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/rylan/Documents/PehlevanLab-Dopamine/tmp/dopamine_run/c51/Pong/1/checkpoints/ckpt.199: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
2021-01-06 16:05:31.710669: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open /home/rylan/Documents/PehlevanLab-Dopamine/tmp/dopamine_run/c51/Pong/1/checkpoints/ckpt.199: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
2021-01-06 16:05:31.710684: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_tensor.cc:175 : Data loss: Unable to open table file /home/rylan/Documents/PehlevanLab-Dopamine/tmp/dopamine_run/c51/Pong/1/checkpoints/ckpt.199: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine/discrete_domains/run_experiment.py", line 228, in __init__
    self._initialize_checkpointer_and_maybe_resume(checkpoint_file_prefix)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine/utils/example_viz_lib.py", line 150, in _initialize_checkpointer_and_maybe_resume
    self._use_legacy_checkpoint)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine/utils/example_viz_lib.py", line 126, in reload_checkpoint
    reloader.restore(self._sess, checkpoint_path)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1298, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/home/rylan/Documents/PehlevanLab-Dopamine/dopamine_venv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/rylan/Documents/PehlevanLab-Dopamine/tmp/dopamine_run/c51/Pong/1/checkpoints/ckpt.199: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
     [[node save_1/RestoreV2 (defined at home/rylan/Documents/PehlevanLab-Dopamine/dopamine/utils/example_viz_lib.py:125) ]]

@psc-g do you know what causes this error? I've tried both use_legacy_checkpoint=True and use_legacy_checkpoint=False and both had this error.

RylanSchaeffer commented 3 years ago

Ok I figured out what the issue is. For anyone else, suppose your checkpoint files are

<some dir>/checkpoints/tf_ckpt-199.data-00000-of-00001 /tmp
<some dir>/checkpoints/tf_ckpt-199.index /tmp
<some dir>/checkpoints/tf_ckpt-199.meta /tmp

Then the restore_ckpt argument should be <some dir>/checkpoints/tf_ckpt-199. Yes, replace the . with a - and ignore the ckpt.199 file.