Open Danielyijun opened 4 years ago
Could you help me to check it out? Really appreciate it. Thank you :) @sj6077
@Danielyijun How did you install TensorFlow and horovod? Parallax requires TensorFlow built on the submodule in this repository, and Horovod is also required. You can follow the installation guide here. https://github.com/snuspl/parallax/blob/master/doc/installation.md
Yes, I followed all the guide to install TensorFlow and horovod. @sj6077
What OS are you using?
Ubuntu 16.04
Did you enable nGraph when configuring TensorFlow build from source?
No, how can I do it? @sj6077
I'm not sure for NGraphVariable in your error message but it seems related to the nGraph below. If you enter "y" for it, it would be enabled.
Oh. If I got it right, I should have chosen y for nGraph. But I will try to reinstall tensorflow and make sure it. I'll let you know how it goes. Thank you.
May I ask does the bazel version matter? Do you have any recommendation as well?
You must not enable the nGraph, so "N" or Enter is the right choice. I used an old version, I think it was 0.15.
Okay, I got it, thank you. :)
Hi, I disabled nGraph when installing tensorflow but the issue was still there.
2020-05-14 09:24:36.503557: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node embeddings/encoder/embedding_encoder}} = NGraphVariable[_class=["loc:@embeddings/encoder/embedding_encoder/Assign"], container="train", dtype=DT_FLOAT, just_looking=false, shape=[7709,32], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'
I think it is highly related to nGraph as in here(https://github.com/NervanaSystems/ngraph-tf/pull/479). Please check your TensorFlow version. If you already have a TF in the machine, you have to uninstall it and reinstall the new version.
Can you run any code without parallax? If the error occurs again, it's not the parallax issue.
Can you run any code without parallax? If the error occurs again, it's not the parallax issue.
Hi, I tried this way finding that I have another tensorflow folder with older vision which have an impact when running tensorflow. I deleted the folder and the problem is gone. Thank you.
I can run nmt.py now. But another issue is still there. When I run distributed-driver of nmt( I set worker and ps are in same node), it is stucking on this: 2020-05-14 20:36:55.357505: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:36:56.272566: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:05.357752: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:06.272764: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:15.358582: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:16.273010: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
It seems it cannot contact with ps. Could you help me to figure it out? @sj6077
I tried to run another command in terminal: CUDA_VISIBLE_DEVICES='' python3 /tmp/parallax-jyi/launch_ps.py --job_name=ps --task_index=0 --protocol=grpc --ps_hosts=10.0.0.103:36311,10.0.0.108:45713 --worker_hosts=10.0.0.103:44472,10.0.0.103:45326,10.0.0.103:40853,10.0.0.103:45386,10.0.0.108:38017,10.0.0.108:45945,10.0.0.108:46772,10.0.0.108:38564
This makes the whole training start, I wondered why we need to run another command with python3 nmt_distributed_driver.py --src=vi --tgt=en --hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json --out_dir=/tmp/deen_gnmt --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013
Failure Logs
Hi,
Good day. I have tried to run example: Simple in your code. I have an issue when I was running the example.
CUDA Toolkit 9.0 CuDNN SDK v7 openmpi-3.0.0 NCCL 2.1.15(for cuda9.0)
Below is the result from running simple example:
/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:139880522606336:PARALLAX:[31m $ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null[0m Connection to 10.0.0.103 closed. WARNING:139880522606336:PARALLAX:[31m $ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null[0m Connection to 10.0.0.103 closed. WARNING:139880522606336:PARALLAX:[31m $ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null[0m Connection to 10.0.0.103 closed. WARNING:139880522606336:PARALLAX:[31m $ ssh -p 22 10.0.0.103 "mkdir -p /tmp/parallax-jyi"[0m WARNING:139880522606336:PARALLAX:[31m $ echo 'bash -c "export schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; CUDA_VISIBLE_DEVICES=1; export PARALLAX_LOG_LEVEL=20; export PARALLAX_HOSTNAME=10.0.0.103; export PARALLAX_SEARCH=False; source /home/jyi/parallax_venv/bin/activate; python3 /home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py "' | ssh -p 22 10.0.0.103 'cat > /tmp/parallax-jyi/mpi_run.sh; chmod 777 /tmp/parallax-jyi/mpi_run.sh'[0m WARNING:139880522606336:PARALLAX:[31m $ schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; export CUDA_VISIBLE_DEVICES=1; source /home/jyi/parallax_venv/bin/activate; export PATH=/Home/.openmpi/bin:$PATH;export LD_LIBRARY_PATH=~/.openmpi/lib/:$LD_LIBRARY_PATH; mpirun -bind-to none -map-by slot --mca plm_rsh_no_tree_spawn 1 --mca orte_base_help_aggregate 0 -x NCCL_DEBUG=INFO -x PARALLAX_RUN_OPTION=PARALLAX_RUN_MPI -x PARALLAX_RESOURCE_INFO=master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1 -np 1 -H 10.0.0.103:1 bash /tmp/parallax-jyi/mpi_run.sh 2>&1[0m /bin/sh: 1: schroot: not found /bin/sh: 1: source: not found bash: line 0: export:
-c': not a valid identifier bash: line 0: export:
-u': not a valid identifier /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) INFO:139646709864192:PARALLAX:parallel_run(PARALLAX_RUN_MPI) INFO:139646709864192:PARALLAX:resource master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1[[43684,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces:
Module: OpenFabrics (openib) Host: node03
Another transport will be used instead, although this may result in lower performance.
NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.
2020-05-14 05:49:50.857407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1412] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335 pciBusID: 0000:03:00.0 totalMemory: 7.93GiB freeMemory: 7.82GiB 2020-05-14 05:49:50.857465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1491] Adding visible gpu devices: 0 2020-05-14 05:49:51.340142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-05-14 05:49:51.340220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:978] 0 2020-05-14 05:49:51.340235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0: N 2020-05-14 05:49:51.340349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1) 2020-05-14 05:49:51.602832: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'
2020-05-14 05:49:51.602953: E tensorflow/core/common_runtime/executor.cc:630] Executor failed to create kernel. Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'
Traceback (most recent call last): File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call return fn(*args) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(**kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 192, in parallax_run_mpi
config=sess_config)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init
_WrappedSession.init(self, self._create_session())
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
init_fn=self._scaffold.init_fn)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()
. Registered: device='CPU'
Caused by op 'w', defined at: File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in
tf.app.run()
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main
parallax.parallel_run(single_gpu_graph, resource_info)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run
return parallax_run_mpi(kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 158, in parallax_run_mpi
tf.train.import_meta_graph(mpi_meta_graph_def)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1666, in import_meta_graph
meta_graph_or_file, clear_devices, import_scope, kwargs)[0]
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1688, in _import_meta_graph_with_return_elements
*kwargs))
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(args, **kwargs)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
_ProcessNewOps(graph)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in
for c_op in c_api_util.new_tf_operations(self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3297, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in init
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[43684,1],0] Exit code: 1