tobegit3hub / tensorflow_template_application

TensorFlow template application for deep learning
Apache License 2.0
1.87k stars 715 forks source link

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8' #11

Open anseey opened 7 years ago

anseey commented 7 years ago

distributed/cancer_classifier.py works in only one docker container.

It works in one container:

# both in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=ps --task_index=0
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.3:8223 --job_name=worker --task_index=0

But it not work in two containers:

# ps in 127.17.0.3
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=ps --task_index=0`
# worker in 127.17.0.4
python cancer_classifier.py --ps_hosts=127.17.0.3:8222 --worker_hosts=127.17.0.4:8223 --job_name=worker --task_index=0

the error msg I got in the worker:

I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> 127.17.0.3:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8222}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:8222
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py:344 in __init__.: __init__ (from tensorflow.python.training.summary_io) is deprecated and will be removed after 2016-11-30.
Instructions for updating:
Please switch to tf.summary.FileWriter. The interface and behavior is the same; this is just a rename.
I tensorflow/core/distributed_runtime/master_session.cc:993] Start master session 91acfc1008531f4d with config:

Traceback (most recent call last):
  File "cancer_classifier_new.py", line 241, in <module>
    tf.app.run(main=main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "cancer_classifier_new.py", line 209, in main
    with sv.managed_session(server.target) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 974, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 802, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 963, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 720, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 227, in prepare_session
    config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 173, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
     [[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]

Caused by op u'save/RestoreV2_8', defined at:
  File "cancer_classifier_new.py", line 241, in <module>
    tf.app.run(main=main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run
    sys.exit(main(sys.argv[:1] + flags_passthrough))
  File "cancer_classifier_new.py", line 191, in main
    saver = tf.train.Saver()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_8': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:worker/replica:0/task:0/cpu:0
     [[Node: save/RestoreV2_8 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_8/tensor_names, save/RestoreV2_8/shape_and_slices)]]
tobegit3hub commented 7 years ago

It may be the bug of the script for distributed training with latest TensorFlow.

I will refactor the code for distributed soon. If you want to run distributed TensorFlow application, please try tobegit3hub/distributed_tensorflow which is much better now.

anseey commented 7 years ago

@tobegit3hub Thank you! I have tried tobegit3hub/distributed_tensorflow, it works! But it still has the problem https://github.com/tensorflow/tensorflow/issues/5110

tobegit3hub commented 7 years ago

Yes, that's something we're working for now.