snuspl / parallax

A Tool for Automatic Parallelization of Deep Learning Training in Distributed Multi-GPU Environments.
Apache License 2.0
130 stars 35 forks source link

Training Simple Example issue #33

Open Danielyijun opened 4 years ago

Danielyijun commented 4 years ago

Failure Logs

Hi,

Good day. I have tried to run example: Simple in your code. I have an issue when I was running the example.

CUDA Toolkit 9.0 CuDNN SDK v7 openmpi-3.0.0 NCCL 2.1.15(for cuda9.0)

Below is the result from running simple example:

/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) WARNING:139880522606336:PARALLAX: $ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null Connection to 10.0.0.103 closed. WARNING:139880522606336:PARALLAX: $ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null Connection to 10.0.0.103 closed. WARNING:139880522606336:PARALLAX: $ ssh -tt -p 22 10.0.0.103 'bash -c "source /home/jyi/parallax_venv/bin/activate; export PATH=/usr/local/cuda-9.0/bin:$PATH; export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH; python3 -m ephemeral_port_reserve"' </dev/null Connection to 10.0.0.103 closed. WARNING:139880522606336:PARALLAX: $ ssh -p 22 10.0.0.103 "mkdir -p /tmp/parallax-jyi" WARNING:139880522606336:PARALLAX: $ echo 'bash -c "export schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; CUDA_VISIBLE_DEVICES=1; export PARALLAX_LOG_LEVEL=20; export PARALLAX_HOSTNAME=10.0.0.103; export PARALLAX_SEARCH=False; source /home/jyi/parallax_venv/bin/activate; python3 /home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py "' | ssh -p 22 10.0.0.103 'cat > /tmp/parallax-jyi/mpi_run.sh; chmod 777 /tmp/parallax-jyi/mpi_run.sh' WARNING:139880522606336:PARALLAX: $ schroot -c jyi -u jyi;export GRPC_POLL_STRATEGY=poll; export CUDA_VISIBLE_DEVICES=1; source /home/jyi/parallax_venv/bin/activate; export PATH=/Home/.openmpi/bin:$PATH;export LD_LIBRARY_PATH=~/.openmpi/lib/:$LD_LIBRARY_PATH; mpirun -bind-to none -map-by slot --mca plm_rsh_no_tree_spawn 1 --mca orte_base_help_aggregate 0 -x NCCL_DEBUG=INFO -x PARALLAX_RUN_OPTION=PARALLAX_RUN_MPI -x PARALLAX_RESOURCE_INFO=master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1 -np 1 -H 10.0.0.103:1 bash /tmp/parallax-jyi/mpi_run.sh 2>&1 /bin/sh: 1: schroot: not found /bin/sh: 1: source: not found bash: line 0: export: -c': not a valid identifier bash: line 0: export:-u': not a valid identifier /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) INFO:139646709864192:PARALLAX:parallel_run(PARALLAX_RUN_MPI) INFO:139646709864192:PARALLAX:resource master_10.0.0.103:40781:^ps_10.0.0.103:44002:^worker_10.0.0.103:46632:1

[[43684,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces:

Module: OpenFabrics (openib) Host: node03

Another transport will be used instead, although this may result in lower performance.

NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.

2020-05-14 05:49:50.857407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1412] Found device 0 with properties: name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335 pciBusID: 0000:03:00.0 totalMemory: 7.93GiB freeMemory: 7.82GiB 2020-05-14 05:49:50.857465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1491] Adding visible gpu devices: 0 2020-05-14 05:49:51.340142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-05-14 05:49:51.340220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:978] 0 2020-05-14 05:49:51.340235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:991] 0: N 2020-05-14 05:49:51.340349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1104] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7535 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0, compute capability: 6.1) 2020-05-14 05:49:51.602832: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'

2020-05-14 05:49:51.602953: E tensorflow/core/common_runtime/executor.cc:630] Executor failed to create kernel. Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Traceback (most recent call last): File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call return fn(*args) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in tf.app.run() File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main parallax.parallel_run(single_gpu_graph, resource_info) File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run return parallax_run_mpi(**kwargs) File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 192, in parallax_run_mpi config=sess_config) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession stop_grace_period_secs=stop_grace_period_secs) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 921, in init stop_grace_period_secs=stop_grace_period_secs) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 643, in init self._sess = _RecoverableSession(self._coordinated_creator) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in init _WrappedSession.init(self, self._create_session()) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session return self._sess_creator.create_session() File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session self.tf_sess = self._session_creator.create_session() File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session init_fn=self._scaffold.init_fn) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 287, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'w', defined at: File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 137, in tf.app.run() File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/home/jyi/parallax/parallax/parallax/examples/simple/simple_driver.py", line 133, in main parallax.parallel_run(single_gpu_graph, resource_info) File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/common/runner.py", line 189, in parallel_run return parallax_run_mpi(kwargs) File "/home/jyi/anaconda3/lib/python3.6/site-packages/parallax/core/python/mpi/runner.py", line 158, in parallax_run_mpi tf.train.import_meta_graph(mpi_meta_graph_def) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1666, in import_meta_graph meta_graph_or_file, clear_devices, import_scope, kwargs)[0] File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1688, in _import_meta_graph_with_return_elements *kwargs)) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements return_elements=return_elements) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, **kwargs) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def _ProcessNewOps(graph) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in _add_new_tf_operations for c_op in c_api_util.new_tf_operations(self) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3438, in for c_op in c_api_util.new_tf_operations(self) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3297, in _create_op_from_tf_operation ret = Operation(c_op, self) File "/home/jyi/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in init self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'

 [[{{node w}} = NGraphVariable[_class=["loc:@w/Assign"], container="", dtype=DT_FLOAT, just_looking=false, shape=[2,1], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[43684,1],0] Exit code: 1

Danielyijun commented 4 years ago

Could you help me to check it out? Really appreciate it. Thank you :) @sj6077

sj6077 commented 4 years ago

@Danielyijun How did you install TensorFlow and horovod? Parallax requires TensorFlow built on the submodule in this repository, and Horovod is also required. You can follow the installation guide here. https://github.com/snuspl/parallax/blob/master/doc/installation.md

Danielyijun commented 4 years ago

Yes, I followed all the guide to install TensorFlow and horovod. @sj6077

sj6077 commented 4 years ago

What OS are you using?

Danielyijun commented 4 years ago

Ubuntu 16.04

sj6077 commented 4 years ago

Did you enable nGraph when configuring TensorFlow build from source?

Danielyijun commented 4 years ago

No, how can I do it? @sj6077

sj6077 commented 4 years ago

I'm not sure for NGraphVariable in your error message but it seems related to the nGraph below. If you enter "y" for it, it would be enabled.

스크린샷 2020-05-14 오후 3 53 37
Danielyijun commented 4 years ago

Oh. If I got it right, I should have chosen y for nGraph. But I will try to reinstall tensorflow and make sure it. I'll let you know how it goes. Thank you.

Danielyijun commented 4 years ago

May I ask does the bazel version matter? Do you have any recommendation as well?

sj6077 commented 4 years ago

You must not enable the nGraph, so "N" or Enter is the right choice. I used an old version, I think it was 0.15.

Danielyijun commented 4 years ago

Okay, I got it, thank you. :)

Danielyijun commented 4 years ago

Hi, I disabled nGraph when installing tensorflow but the issue was still there.


2020-05-14 09:24:36.503557: E tensorflow/core/framework/op_segment.cc:53] Create kernel failed: Not found: No registered 'NGraphVariable' OpKernel for GPU devices compatible with node {{node embeddings/encoder/embedding_encoder}} = NGraphVariable[_class=["loc:@embeddings/encoder/embedding_encoder/Assign"], container="train", dtype=DT_FLOAT, just_looking=false, shape=[7709,32], shared_name="", _device="/job:localhost/replica:0/task:0/device:GPU:0"]() . Registered: device='CPU'

sj6077 commented 4 years ago

I think it is highly related to nGraph as in here(https://github.com/NervanaSystems/ngraph-tf/pull/479). Please check your TensorFlow version. If you already have a TF in the machine, you have to uninstall it and reinstall the new version.

sj6077 commented 4 years ago

Can you run any code without parallax? If the error occurs again, it's not the parallax issue.

Danielyijun commented 4 years ago

Can you run any code without parallax? If the error occurs again, it's not the parallax issue.

Hi, I tried this way finding that I have another tensorflow folder with older vision which have an impact when running tensorflow. I deleted the folder and the problem is gone. Thank you.

I can run nmt.py now. But another issue is still there. When I run distributed-driver of nmt( I set worker and ps are in same node), it is stucking on this: 2020-05-14 20:36:55.357505: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:36:56.272566: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:05.357752: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:06.272764: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:15.358582: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-05-14 20:37:16.273010: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

It seems it cannot contact with ps. Could you help me to figure it out? @sj6077

Danielyijun commented 4 years ago

I tried to run another command in terminal: CUDA_VISIBLE_DEVICES='' python3 /tmp/parallax-jyi/launch_ps.py --job_name=ps --task_index=0 --protocol=grpc --ps_hosts=10.0.0.103:36311,10.0.0.108:45713 --worker_hosts=10.0.0.103:44472,10.0.0.103:45326,10.0.0.103:40853,10.0.0.103:45386,10.0.0.108:38017,10.0.0.108:45945,10.0.0.108:46772,10.0.0.108:38564

This makes the whole training start, I wondered why we need to run another command with python3 nmt_distributed_driver.py --src=vi --tgt=en --hparams_path=nmt/standard_hparams/wmt16_gnmt_4_layer.json --out_dir=/tmp/deen_gnmt --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013