tensorflow / models

Models and examples built with TensorFlow
Other
76.99k stars 45.79k forks source link

Restoring from checkpoint failed #8638

Open ouisyasser opened 4 years ago

ouisyasser commented 4 years ago

when I restored the checkpoints:


with tf.Session() as sess: saver = tf.train.import_meta_graph('/content/drive/My Drive/cifar-10/model/model.ckpt-199.meta') saver.restore(sess, "/content/drive/My Drive/cifar-10/model/model.ckpt-199")


InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Cannot assign a device for operation model/grad_norm: Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_nameindex=-1 requested_devicename='/device:GPU:0' assigned_devicename='' resource_devicename='' supported_devicetypes=[CPU] possibledevices=[] ScalarSummary: CPU

Colocation members, user-requested devices, and framework assigned devices, if any: model/grad_norm (ScalarSummary) /device:GPU:0

Op: ScalarSummary Node attrs: T=DT_FLOAT Registered kernels: device='CPU'; T in [DT_DOUBLE] device='CPU'; T in [DT_FLOAT] device='CPU'; T in [DT_BFLOAT16] device='CPU'; T in [DT_HALF] device='CPU'; T in [DT_INT8] device='CPU'; T in [DT_UINT8] device='CPU'; T in [DT_INT16] device='CPU'; T in [DT_UINT16] device='CPU'; T in [DT_INT32] device='CPU'; T in [DT_INT64]

 [[node model/grad_norm (defined at /tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for u'model/grad_norm': File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 499, in start self.io_loop.start() File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 888, in start handler_func(fd_obj, events) File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(*args, kwargs) File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 456, in _handle_events self._handle_recv() File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 486, in _handle_recv self._run_callback(callback, msg) File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 438, in _run_callback callback(*args, *kwargs) File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 277, in null_wrapper return fn(args, kwargs) File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher return self.dispatch_shell(stream, msg) File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell handler(stream, idents, msg) File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request user_expressions, allow_stdin) File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/usr/local/lib/python2.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell return super(ZMQInteractiveShell, self).run_cell(*args, kwargs) File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell interactivity=interactivity, compiler=compiler, result=result) File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes if self.run_code(code, result): File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 2, in saver = tf.train.import_meta_graph('/content/drive/My Drive/cifar-10/model/model.ckpt-198.meta') File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/training/saver.py", line 1453, in import_meta_graph kwargs)[0] File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/training/saver.py", line 1477, in _import_meta_graph_with_return_elements *kwargs)) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements return_elements=return_elements) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def producer_op_list=producer_op_list) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/importer.py", line 517, in _import_graph_def_internal _ProcessNewOps(graph) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/importer.py", line 243, in _ProcessNewOps for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/ops.py", line 3561, in _add_new_tf_operations for c_op in c_api_util.new_tf_operations(self) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/ops.py", line 3451, in _create_op_from_tf_operation ret = Operation(c_op, self) File "/tensorflow-1.15.2/python2.7/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

kyscg commented 4 years ago

You just have to change the model directory where you have saved your old checkpoints. You can even try clearing your checkpoints file if you have any left from training other models. Please close the issue if this has solved your problem

ouisyasser commented 4 years ago

@kyscg the directory is where my training checkpoints are and I have not trained any other model !

kyscg commented 4 years ago

Try using sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)). This will resolve the problem if it couldn't place an operation on the GPU. Since some operations have only CPU implementation.

Using allow_soft_placement=True will allow TensorFlow to fall back to CPU when no GPU implementation is available.

And could you share some code to reproduce the issue? I'm just guessing at this point about whether there isn't int32 support. Thank you.

ouisyasser commented 4 years ago

thank you for help can you mention how to predict with the saver ? the code is https://colab.research.google.com/drive/1TLitkT-LghpLsZ4EJX9PRS1Q7bXOq_8I?usp=sharing

csingh27 commented 4 years ago

I am still struggling with the error. Any working solution ?

nietzsche9088 commented 2 years ago

Me too. Do you fix it?