tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.1k stars 3.44k forks source link

distributed training fail #975

Open Mack-y opened 5 years ago

Mack-y commented 5 years ago

Description

hello everyone, I'm a newbie with t2t and tensorflow. I tried to use t2t to run transformer_moe model in 2 machines ,but failed. Each one has only one gpu. Hope you guys could help me dig out what's the problem and sovle it. THKS

Environment information

PROBLEM=translate_ende_wmt32k MODEL=transformer_moe HPARAMS=transformer_moe_base DATA_DIR=~/distributed_train_share/t2t_data TMP_DIR=~/distributed_train_share/t2t_datagen TRAIN_DIR=~/distributed_train_share/t2t_train/$PROBLEM/$MODEL-$HPARAMS

OS:

$ pip freeze | grep tensor tensor2tensor==1.6.6 tensorboard==1.7.0 tensorflow-gpu==1.7.0 tensorflow-tensorboard==0.4.0

$ python -V Python 2.7.6

For bugs: reproduction and error logs

Steps to reproduce:

//master: export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["172.17.2.11:2222"], "master": ["172.17.2.7:2222"]}, "task": {"index": 0, "type": "master"}}'

t2t-trainer --data_dir=$DATA_DIR --problem=$PROBLEM --model=$MODEL --output_dir=$TRAIN_DIR --hparams_set=$HPARAMS --train_steps=1500 --hparams='layer_types=a/a/a-moe/a' --master=grpc://172.17.2.7:2222 --worker_replicas=1 --worker_gpu=1 --worker_id=0 --worker_job='/job:master' --ps_replicas=1 --ps_gpu=1 --schedule=train

//ps: export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["172.17.2.11:2222"], "master": ["172.17.2.7:2222"]}, "task": {"index": 0, "type": "ps"}}'

t2t-trainer --data_dir=$DATA_DIR --problem=$PROBLEM --model=$MODEL --output_dir=$TRAIN_DIR --hparams_set=$HPARAMS --train_steps=1500 --hparams='layer_types=a/a/a-moe/a' --master=grpc://172.17.2.11:2222 --schedule=run_std_server

Error logs:

//master log: WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. INFO:tensorflow:Overriding hparams in transformer_moe_base with layer_types=a/a/a-moe/a [2018-08-07 14:33:38,532] Overriding hparams in transformer_moe_base with layer_types=a/a/a-moe/a WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py:165: init (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version. Instructions for updating: When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead. [2018-08-07 14:33:38,590] From /usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py:165: init (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version. Instructions for updating: When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead. INFO:tensorflow:schedule=train [2018-08-07 14:33:38,591] schedule=train INFO:tensorflow:worker_gpu=1 [2018-08-07 14:33:38,591] worker_gpu=1 INFO:tensorflow:sync=False [2018-08-07 14:33:38,591] sync=False INFO:tensorflow:datashard_devices: [<bound method _ReplicaDeviceChooser.device_function of <tensorflow.python.training.device_setter._ReplicaDeviceChooser object at 0x7f146ea3fb90>>] [2018-08-07 14:33:38,591] datashard_devices: [<bound method _ReplicaDeviceChooser.device_function of <tensorflow.python.training.device_setter._ReplicaDeviceChooser object at 0x7f146ea3fb90>>] INFO:tensorflow:caching_devices: None [2018-08-07 14:33:38,591] caching_devices: None INFO:tensorflow:ps_devices: ['/job:ps/task:0/GPU:0'] [2018-08-07 14:33:38,591] ps_devices: ['/job:ps/task:0/GPU:0'] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': u'mater', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f146ea3f950>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.95 } allow_soft_placement: true graph_options { optimizer_options { } } , 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 1, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': 'grpc://172.17.2.7:2222', '_log_step_count_steps': 100, '_num_ps_replicas': 1, '_is_chief': False, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1 } , '_save_checkpoints_steps': 1000, '_environment': u'cloud', '_master': 'grpc://172.17.2.7:2222', '_model_dir': '/home/yb/distributed_train_share/t2t_train/translate_ende_wmt32k/transformer_moe-transformer_moe_base', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f146ea3fc90>, '_save_summary_steps': 100} [2018-08-07 14:33:38,795] Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': u'mater', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f146ea3f950>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.95 } allow_soft_placement: true graph_options { optimizer_options { } } , 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 1, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': 'grpc://172.17.2.7:2222', '_log_step_count_steps': 100, '_num_ps_replicas': 1, '_is_chief': False, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1 } , '_save_checkpoints_steps': 1000, '_environment': u'cloud', '_master': 'grpc://172.17.2.7:2222', '_model_dir': '/home/yb/distributed_train_share/t2t_train/translate_ende_wmt32k/transformer_moe-transformer_moe_base', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f146ea3fc90>, '_save_summary_steps': 100} WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7f146e9948c0>) includes params argument, but params are not passed to Estimator. [2018-08-07 14:33:38,796] Estimator's model_fn (<function wrapping_model_fn at 0x7f146e9948c0>) includes params argument, but params are not passed to Estimator. INFO:tensorflow:Reading data files from /home/yb/distributed_train_share/t2t_data/translate_ende_wmt32k-train [2018-08-07 14:33:38,828] Reading data files from /home/yb/distributed_train_share/t2t_data/translate_ende_wmt32k-train INFO:tensorflow:partition: 0 num_data_files: 100 [2018-08-07 14:33:38,832] partition: 0 num_data_files: 100 INFO:tensorflow:Calling model_fn. [2018-08-07 14:33:39,507] Calling model_fn. INFO:tensorflow:Setting T2TModel mode to 'train' [2018-08-07 14:33:39,971] Setting T2TModel mode to 'train' INFO:tensorflow:Using variable initializer: uniform_unit_scaling [2018-08-07 14:33:39,971] Using variable initializer: uniform_unit_scaling INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_33708_512.bottom [2018-08-07 14:33:40,049] Transforming feature 'inputs' with symbol_modality_33708_512.bottom INFO:tensorflow:Transforming 'targets' with symbol_modality_33708_512.targets_bottom [2018-08-07 14:33:40,074] Transforming 'targets' with symbol_modality_33708_512.targets_bottom INFO:tensorflow:Encoder architecture: [2018-08-07 14:33:40,361] Encoder architecture: INFO:tensorflow: Layer 0: a - fc [2018-08-07 14:33:40,361] Layer 0: a - fc INFO:tensorflow: Layer 1: a - fc [2018-08-07 14:33:40,361] Layer 1: a - fc INFO:tensorflow: Layer 2: a - moe [2018-08-07 14:33:40,361] Layer 2: a - moe INFO:tensorflow: Layer 3: a - fc [2018-08-07 14:33:40,361] Layer 3: a - fc INFO:tensorflow:Decoder architecture: [2018-08-07 14:33:40,361] Decoder architecture: INFO:tensorflow: Layer 0: a - a - fc [2018-08-07 14:33:40,361] Layer 0: a - a - fc INFO:tensorflow: Layer 1: a - a - fc [2018-08-07 14:33:40,361] Layer 1: a - a - fc INFO:tensorflow: Layer 2: a - a - moe [2018-08-07 14:33:40,361] Layer 2: a - a - moe INFO:tensorflow: Layer 3: a - a - fc [2018-08-07 14:33:40,361] Layer 3: a - a - fc INFO:tensorflow:Transforming body output with symbol_modality_33708_512.top [2018-08-07 14:33:53,691] Transforming body output with symbol_modality_33708_512.top INFO:tensorflow:Base learning rate: 0.100000 [2018-08-07 14:33:53,892] Base learning rate: 0.100000 INFO:tensorflow:Trainable Variables Total size: 311045120 [2018-08-07 14:33:53,907] Trainable Variables Total size: 311045120 INFO:tensorflow:Using optimizer Adam [2018-08-07 14:33:53,907] Using optimizer Adam /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Done calling model_fn. [2018-08-07 14:34:18,322] Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. [2018-08-07 14:34:18,323] Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. [2018-08-07 14:34:22,997] Graph was finalized. ---------------------training process suspended here------------------------

//ps log: WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. INFO:tensorflow:Overriding hparams in transformer_moe_base with layer_types=a/a/a-moe/a [2018-08-07 22:47:03,311] Overriding hparams in transformer_moe_base with layer_types=a/a/a-moe/a WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py:165: init (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version. Instructions for updating: When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead. [2018-08-07 22:47:03,370] From /usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py:165: init (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version. Instructions for updating: When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead. INFO:tensorflow:schedule=run_std_server [2018-08-07 22:47:03,371] schedule=run_std_server INFO:tensorflow:worker_gpu=1 [2018-08-07 22:47:03,371] worker_gpu=1 INFO:tensorflow:sync=False [2018-08-07 22:47:03,371] sync=False INFO:tensorflow:datashard_devices: ['/job:localhost'] [2018-08-07 22:47:03,371] datashard_devices: ['/job:localhost'] INFO:tensorflow:caching_devices: None [2018-08-07 22:47:03,371] caching_devices: None INFO:tensorflow:ps_devices: ['gpu:0'] [2018-08-07 22:47:03,371] ps_devices: ['gpu:0'] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': u'ps', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9eb2a0b990>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.95 } allow_soft_placement: true graph_options { optimizer_options { } } , 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 1, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': 'grpc://172.17.2.11:2222', '_log_step_count_steps': 100, '_num_ps_replicas': 1, '_is_chief': False, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_save_checkpoints_steps': 1000, '_environment': u'cloud', '_master': 'grpc://172.17.2.11:2222', '_model_dir': '/home/yb/distributed_train_share/t2t_train/translate_ende_wmt32k/transformer_moe-transformer_moe_base', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f9eb2a0b950>, '_save_summary_steps': 100} [2018-08-07 22:47:03,557] Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': u'ps', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9eb2a0b990>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.95 } allow_soft_placement: true graph_options { optimizer_options { } } , 'use_tpu': False, '_tf_random_seed': None, '_num_worker_replicas': 1, '_task_id': 0, 't2t_device_info': {'num_async_replicas': 1}, '_evaluation_master': 'grpc://172.17.2.11:2222', '_log_step_count_steps': 100, '_num_ps_replicas': 1, '_is_chief': False, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_save_checkpoints_steps': 1000, '_environment': u'cloud', '_master': 'grpc://172.17.2.11:2222', '_model_dir': '/home/yb/distributed_train_share/t2t_train/translate_ende_wmt32k/transformer_moe-transformer_moe_base', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f9eb2a0b950>, '_save_summary_steps': 100} WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7f9eb299f140>) includes params argument, but params are not passed to Estimator. [2018-08-07 22:47:03,557] Estimator's model_fn (<function wrapping_model_fn at 0x7f9eb299f140>) includes params argument, but params are not passed to Estimator. 2018-08-07 22:47:03.558818: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-08-07 22:47:04.754092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:83:00.0 totalMemory: 22.38GiB freeMemory: 22.21GiB 2018-08-07 22:47:04.754138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-08-07 22:47:05.048669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-08-07 22:47:05.048720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-08-07 22:47:05.048730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-08-07 22:47:05.049243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 22912 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:83:00.0, compute capability: 6.1) 2018-08-07 22:47:05.054553: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 22.38G (24025956352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2018-08-07 22:47:05.274351: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> 172.17.2.7:2222} 2018-08-07 22:47:05.274382: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222} 2018-08-07 22:47:05.280352: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:2222

I tried to switch the grpc address between ps and mster like below: //master: t2t-trainer --data_dir=$DATA_DIR --problem=$PROBLEM --model=$MODEL --output_dir=$TRAIN_DIR --hparams_set=$HPARAMS --train_steps=1500 --hparams='layer_types=a/a/a-moe/a' --master=grpc://172.17.2.11:2222 --worker_replicas=1 --worker_gpu=1 --worker_id=0 --worker_job='/job:master' --ps_replicas=1 --ps_gpu=1 --schedule=train

//ps: t2t-trainer --data_dir=$DATA_DIR --problem=$PROBLEM --model=$MODEL --output_dir=$TRAIN_DIR --hparams_set=$HPARAMS --train_steps=1500 --hparams='layer_types=a/a/a-moe/a' --master=grpc://172.17.2.7:2222 --schedule=run_std_server

run the scripts: //ps: 2018-08-07 19:57:44.289785: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 22.38G (24025956352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2018-08-07 19:57:44.507771: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> 172.17.2.7:2222} 2018-08-07 19:57:44.507792: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222} 2018-08-07 19:57:44.514275: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:2222

2018-08-07 22:23:47.763034: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-08-07 22:23:59.024864: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error
2018-08-07 22:24:10.142981: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

//master: [2018-08-07 14:10:28,964] Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. [2018-08-07 14:10:28,965] Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. [2018-08-07 14:10:33,679] Graph was finalized.

INFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error
[2018-08-07 14:10:46,759] An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error
INFO:tensorflow:Graph was finalized.
[2018-08-07 14:10:46,760] Graph was finalized.
INFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error
[2018-08-07 14:10:57,997] An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error
INFO:tensorflow:Graph was finalized.
[2018-08-07 14:10:57,997] Graph was finalized.
INFO:tensorflow:An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error
[2018-08-07 14:11:09,195] An error was raised while a session was being created. This may be due to a preemption of a connected worker or parameter server. A new session will be created. Error: OS Error

OS error turns out in both sides.

question: 1 grpc address refers to the remote address or the local address? 2 whether my configuration is right?

thks~~~

Mack-y commented 5 years ago

we modified the train_lib.py to create a server for worker so that it works well in one machine with mutiple gpus. But this issue is still unsolved. Hope someone would help me deal with it. Many thks!

Mack-y commented 5 years ago

@rsepassi Could you help me solve the problem?

1nsunym commented 5 years ago

I have exactly same problem.

I thought that it might be a network issue so I tried with simple distributed tensorflow example and it works fine.

Mack-y commented 5 years ago

@1nsunym could you show me how you do it? I think I should also do a simple distributed tensorflow example to figure out how t2t works thank you!

Mack-y commented 5 years ago

@1nsunym AND what's your network problem? I could check if there‘s the same problem in my environment.

1nsunym commented 5 years ago

@Mack-y Hi, sorry for the confusion. I meant that simple distributed TF example works fine on my environment whereas distributed T2T doesn't work. I'm still waiting on some kind of a solution just like you but I'm thinking of trying with different versions of T2T.

vsuthichai commented 5 years ago

@1nsunym I'm experiencing what seems to be the same issue as you. Master seems to hang right when graph finalizes. No master socket gets listened to. Additionally, the PS fails with CUDA_ERROR_OUT_OF_MEMORY (8 gpus so 8 error messages). Then it hangs as well. This is really strange. I haven't done anything configuration wise out of the ordinary and have pretty much followed the instructions here : https://github.com/tensorflow/tensor2tensor/blob/master/docs/distributed_training.md

I've tried setting my batch size to 1024 and even 1 just to see if that was the issue. There is a part in the configuration where per_process_gpu_memory_fraction is set to 1 and I'm wondering if that's the cause. Have you had any luck?

1nsunym commented 5 years ago

@vsuthichai My command line flags are pretty much the same as @Mack-y 's and haven't had luck yet. I'm not working on the issue ny more because I have no access to multi node environment at the moment. I'll update something once I get back to working on the distributed training again.

rsepassi commented 5 years ago

Please see the updated documentation on distributed training. I tested training sync and async and things seem to work fine now. Let me know how it goes.

vsuthichai commented 5 years ago

@rsepassi Appreciate the quick response and fix. Unfortunately, the problem hasn't been fully resolved for me. The parameter server error goes away for me. I'm no longer receiving the cuda out of memory error. However, the master still sits at graph finalized and I can see no listen socket port open up on port 8000. The parameter server does open up its listen port on 8001

On my master stdout:

+ PROBLEM=translate_ende_wmt32k
+ MODEL=transformer
+ HPARAMS=transformer_big
+ NUM_GPUS=8
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+ DATA_DIR=/home/ubuntu/benchmarks/scripts/t2t_data
+ TMP_DIR=/tmp/t2t_datagen
+ TRAIN_DIR=/home/ubuntu/benchmarks/scripts/t2t_train/translate_ende_wmt32k/transformer-transformer_big
+ export 'TF_CONFIG={"cluster": {"ps": ["10.0.1.201:8001"], "master": ["10.0.0.205:8000"]}, "task": {"type": "master", "index": 0}, "environment": "cloud"}'
+ TF_CONFIG='{"cluster": {"ps": ["10.0.1.201:8001"], "master": ["10.0.0.205:8000"]}, "task": {"type": "master", "index": 0}, "environment": "cloud"}'
+ rm -rf /home/ubuntu/benchmarks/scripts/t2t_train
+ tensor2tensor/bin/t2t-trainer --data_dir=/home/ubuntu/benchmarks/scripts/t2t_data --problem=translate_ende_wmt32k --model=transformer --hparams_set=transformer_big --output_dir=/home/ubuntu/benchmarks/scripts/t2t_train/translate_ende_wmt32k/transformer-transformer_big --train_steps=4732 --worker_gpu_memory_fraction=0.8 --master=grpc://10.0.0.205:8000 --ps_replicas=1 --worker_replicas=1 --worker_gpu=0 --worker_id=0 --ps_gpu=1 --sync --schedule=train --worker_job=/job:master
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
WARNING:tensorflow:From /home/ubuntu/benchmarks/scripts/tensor2tensor/tensor2tensor/utils/trainer_lib.py:199: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:schedule=train
INFO:tensorflow:worker_gpu=0
INFO:tensorflow:sync=True
INFO:tensorflow:datashard_devices: [<bound method _ReplicaDeviceChooser.device_function of <tensorflow.python.training.device_setter._ReplicaDeviceChooser object at 0x7fbcefcdc978>>]
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['/job:ps/task:0/GPU:0']
INFO:tensorflow:Using config: {'_task_type': 'master', '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fbcefcdc9e8>, '_master': 'grpc://10.0.0.205:8000', '_num_ps_replicas': 1, '_num_worker_replicas': 1, '_environment': 'cloud', '_is_chief': True, '_evaluation_master': 'grpc://10.0.0.205:8000', '_train_distribute': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': None, '_log_step_count_steps': 100, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.8
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_save_checkpoints_steps': 1000, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/home/ubuntu/benchmarks/scripts/t2t_train/translate_ende_wmt32k/transformer-transformer_big', 'warm_start_from': None, 'use_tpu': False, 't2t_device_info': {'num_async_replicas': 1}, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7fbcefcdc940>}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7fbcefcd0bf8>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Reading data files from /home/ubuntu/benchmarks/scripts/t2t_data/translate_ende_wmt32k-train*
INFO:tensorflow:partition: 0 num_data_files: 100
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_33945_1024.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_33945_1024.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_33945_1024.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 211080192
INFO:tensorflow:Using optimizer Adam
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.

and on PS stdout:

+ PROBLEM=translate_ende_wmt32k
+ MODEL=transformer
+ HPARAMS=transformer_big
+ NUM_GPUS=8
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+ DATA_DIR=/home/ubuntu/benchmarks/scripts/t2t_data
+ TMP_DIR=/tmp/t2t_datagen
+ TRAIN_DIR=/home/ubuntu/benchmarks/scripts/t2t_train/translate_ende_wmt32k/transformer-transformer_big
+ export 'TF_CONFIG={"cluster": {"ps": ["10.0.1.201:8001"], "master": ["10.0.0.205:8000"]}, "task": {"type": "ps", "index": 0}, "environment": "cloud"}'
+ TF_CONFIG='{"cluster": {"ps": ["10.0.1.201:8001"], "master": ["10.0.0.205:8000"]}, "task": {"type": "ps", "index": 0}, "environment": "cloud"}'
+ rm -rf /home/ubuntu/benchmarks/scripts/t2t_train
+ tensor2tensor/bin/t2t-trainer --schedule=run_std_server
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'ps': ['10.0.1.201:8001'], 'master': ['10.0.0.205:8000']}, 'task': {'type': 'ps', 'index': 0}, 'environment': 'cloud'}
rsepassi commented 5 years ago

Try using tensor2tensor at HEAD. (clone and then pip install .)

vsuthichai commented 5 years ago

@rsepassi thanks, yeah I'm still encountering the same issue. Both just hang there, however the PS does open up a listen socket and the error has gone away. If the output directory is not shared, would that cause the issue?

Also, I'm in need of additional clarification. I was the under the assumption that if I only set --worker_gpu=8 and ignore TF_CONFIG and any other setup for sync or async training, that it will simply default to async training with 8 gpus. I'm assuming this because I see the configuration print out when I launch t2t-trainer and it specifically says

INFO:tensorflow:sync=False

Now I'm not so sure after having read the updated distributed training documentation because it says if I simply set --worker_gpu=8, it will assume synchronized training with 8 gpus on one node.

Which is it really? sync or async? If it is synchronous, the flag there is a bit misleading.

vsuthichai commented 5 years ago

@rsepassi A bit of an update on the situation. I followed the updated instructions to no avail. Here is a thread dump of where it hangs.

Thread 0x00007f3825ffb700 (most recent call first):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 295 in wait
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/queue.py", line 164 in get
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py", line 159 in run
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f392c775700 (most recent call first):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1353 in _extend_graph
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316 in _run_fn
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334 in _do_call
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328 in _do_run
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1138 in _run
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900 in run
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 288 in prepare_session
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 479 in create_session
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 708 in create_session
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1019 in _create_session
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1014 in __init__
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 551 in __init__
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 828 in __init__
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 416 in MonitoredTrainingSession
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1341 in _train_with_estimator_spec
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1135 in _train_model_default
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1119 in _train_model
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 366 in train
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensor2tensor/utils/trainer_lib.py", line 345 in train
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 326 in execute_schedule
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 385 in main
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/t2t-trainer", line 31 in main
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125 in run
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/t2t-trainer", line 35 in <module>

It seems to be coming from a blocking queue from within the event file writer. Does thing ring any bells?

Update: Actually the event file writer thread makes sense. It's just waiting for events to write to disk. However, beyond _extend_graph, I'm not quite sure what's happening.

vsuthichai commented 5 years ago

@rsepassi An additional update:

#0  0x00007f499d16ea13 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007f48f4b75584 in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007f48f4b97ebf in cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007f48f4b9828b in grpc_completion_queue_pluck () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007f48f4b1f058 in grpc::CoreCodegen::grpc_completion_queue_pluck(grpc_completion_queue*, void*, gpr_timespec, void*) ()
   from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007f48f4a771af in grpc::internal::BlockingUnaryCallImpl<tensorflow::CreateSessionRequest, tensorflow::CreateSessionResponse>::BlockingUnaryCallImpl(grpc::ChannelInterface*, grpc::internal::RpcMethod const&, grpc::ClientContext*, tensorflow::CreateSessionRequest const&, tensorflow::CreateSessionResponse*) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6  0x00007f48f4a77850 in tensorflow::grpc::MasterService::Stub::CreateSession(grpc::ClientContext*, tensorflow::CreateSessionRequest const&, tensorflow::CreateSessionResponse*) ()
   from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00007f48f4a66adb in tensorflow::GrpcRemoteMaster::CreateSession(tensorflow::CallOptions*, tensorflow::CreateSessionRequest const*, tensorflow::CreateSessionResponse*) ()
   from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8  0x00007f48f4a5ec5d in tensorflow::GrpcSession::CreateImpl(tensorflow::CallOptions*, tensorflow::GraphDef const&) ()
   from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9  0x00007f48f4a5f163 in tensorflow::GrpcSession::Create(tensorflow::GraphDef const&) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f48f4a5f42f in tensorflow::GrpcSession::ExtendImpl(tensorflow::CallOptions*, tensorflow::GraphDef const&) ()
   from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f48f4a5f5b3 in tensorflow::GrpcSession::Extend(tensorflow::GraphDef const&) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f48f4d3eba3 in tensorflow::ExtendSessionGraphHelper(TF_Session*, TF_Status*) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f48f49f5151 in tensorflow::ExtendSession(TF_Session*, TF_Status*) () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f48f4990996 in _wrap_ExtendSession () from /home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

A stack trace of the pywrap_tensorflow_internal .. i switched to python 2.7 and tensorflow 1.8.0, changing versions repeatedly, to see if maybe it was my versions. The problem still happened. It looks like it's trying to create a grpc session and master.

vsuthichai commented 5 years ago

It appears that the master is trying to connect to itself over ipv4:10.0.0.205:8000 through grpc. However, there is no listen port up. The following is from turning the GRPC debugging verbosity up.

I0823 01:05:53.842503259   69902 subchannel.cc:425]          Failed to connect to channel, retrying
I0823 01:05:53.842733434   69792 subchannel.cc:646]          Connect failed: {"created":"@1534986353.842665545","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":201,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.0.205:8000"}
I0823 01:05:53.842772416   69792 subchannel.cc:470]          Subchannel 0x7f7d1082b000: Retry in 1000 milliseconds
I0823 01:05:54.842512221   69903 subchannel.cc:425]          Failed to connect to channel, retrying
I0823 01:05:54.842790941   69792 subchannel.cc:646]          Connect failed: {"created":"@1534986354.842711185","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":201,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.0.205:8000"}
I0823 01:05:54.842827053   69792 subchannel.cc:470]          Subchannel 0x7f7d1082b000: Retry in 1000 milliseconds
I0823 01:05:55.842467629   69902 subchannel.cc:425]          Failed to connect to channel, retrying
I0823 01:05:55.842699752   69792 subchannel.cc:646]          Connect failed: {"created":"@1534986355.842629684","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":201,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.0.0.205:8000"}
I0823 01:05:55.842738258   69792 subchannel.cc:470]          Subchannel 0x7f7d1082b000: Retry in 1000 milliseconds
vsuthichai commented 5 years ago

@rsepassi I managed to get it moving past the hang with the fix provided by https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866 AND changing the cluster configuration such that the job named master is named chief. It seems that there were two problems. The master (aka chief) server was never started and that the job had to be explicitly named chief such that there was some chief recognized within the cluster configuration. I see a checkpoint being created now and a loss being printed to the terminal, so I believe it's training in a distributed fashion now. nvidia-smi shows all gpus allocated through --ps_gpu being utilized

vsuthichai commented 5 years ago

If anyone is following this thread anymore, even though I got it working, my throughput on 2 nodes / 8 gpus each is pretty bad and degrades to the same throughput as if I were running it in on a single node with 4 gpus. Single node 8 gpu still performs best.

Mack-y commented 5 years ago

@vsuthichai I‘ve tried to run distributed training among 3 machines. One as ps , one as chief(modify MonitoredTrainingSession to set is_chief=True in this machine),another one as worker. Besides, I set schedule as the default(namely continuous_train_and_eval) which can create grpc server for the workers. Then, worker node managed to train,but eval fails and the following error happed:

INFO:tensorflow:loss = 6.0623927, step = 801 (360.263 sec) INFO:tensorflow:loss = 5.889252, step = 901 (357.355 sec) INFO:tensorflow:loss = 5.7690663, step = 1001 (358.192 sec) INFO:tensorflow:loss = 5.6357236, step = 1101 (358.246 sec) INFO:tensorflow:loss = 5.274427, step = 1201 (360.637 sec) INFO:tensorflow:loss = 5.042751, step = 1301 (355.236 sec) INFO:tensorflow:loss = 5.640498, step = 1401 (356.206 sec) INFO:tensorflow:Loss for final step: 5.4658346. Traceback (most recent call last): File "/usr/local/bin/t2t-trainer", line 32, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "/usr/local/bin/t2t-trainer", line 28, in main t2t_trainer.main(argv) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 359, in main execute_schedule(exp) File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 306, in execute_schedule getattr(exp, FLAGS.schedule)() File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 290, in continuous_train_and_eval return self.evaluate() File "/usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py", line 310, in evaluate name="eval") File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 417, in evaluate name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 918, in _evaluate_model format(self._model_dir)) ValueError: Could not find trained model in model_dir: /home/yb/distributed_train_share/t2t_train/translate_ende_wmt32k/transformer_moe_base/.

Meanwile, an error always tured out in the chief node:


NotFoundError (see above for traceback): /home/yb/distributed_train_share/t2t_train/translate_ende_wmt32k/transformer_moe_base/model.ckpt-9_temp_a5d63af23c3f4c549c19299996caf4b5; No such file or directory [[Node: save/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:0/device:CPU:0"](save/ShardedFilename_1, save/SaveV2_1/tensor_names, save/SaveV2_1/shape_and_slices, training/beta1_power_G1574, training/beta2_power_G1576, training/transformer_moe/decoder/layer_0/att_ende_a/layer_prepostprocess/layer_norm/layer_norm_bias/Adam_G1578, training/transformer_moe/decoder/layer_0/att_ende_a/layer_prepostprocess/layer_norm/layer_norm_bias/Adam_1_G1580, training/transformer_moe/decoder/layer_0/att_ende_a/layer_prepostprocess/layer_norm/layer_norm_scale/Adam_G1582, training/transformer_moe/decoder/layer_0/att_ende_a/layer_prepostprocess/layer_norm/layer_norm_scale/Adam_1_G1584, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/k/kernel/Adam_G1586, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/k/kernel/Adam_1_G1588, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/output_transform/kernel/Adam_G1590, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/output_transform/kernel/Adam_1_G1592, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/q/kernel/Adam_G1594, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/q/kernel/Adam_1_G1596, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/v/kernel/Adam_G1598, training/transformer_moe/decoder/layer_0/att_ende_a/multihead_attention/v/kernel/Adam_1_G1600, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv1_single/bias/Adam_G1602, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv1_single/bias/Adam_1_G1604, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv1_single/kernel/Adam_G1606, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv1_single/kernel/Adam_1_G1608, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv2_single/bias/Adam_G1610, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv2_single/bias/Adam_1_G1612, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv2_single/kernel/Adam_G1614, training/transformer_moe/decoder/layer_0/ff_fc/conv_hidden_relu/conv2_single/kernel/Adam_1_G1616, training/transformer_moe/decoder/layer_0/ff_fc/layer_prepostprocess/layer_norm/layer_norm_bias/Adam_G1618, training/transformer_moe/decoder/layer_0/ff_fc/layer_prepostprocess/layer_norm/layer_norm_bias/Adam_1_G1620, training/transformer_moe/decoder/layer_0/ff_fc/layer_prepostprocess/layer_norm/layer_norm_scale/Adam_G1622, training/transformer_moe/decoder/layer_0/ff_fc/layer_prepostprocess/layer_norm/layer_norm_scale/Adam_1_G1624, training/transformer_moe/decoder/layer_0/self_att_a/layer_prepostprocess/layer_norm/layer_norm_bias/Adam_G1626, training/transf ... [truncated] [[Node: save/Identity_S7817 = _HostRecv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:CPU:0", send_device="/job:ps/replica:0/task:0/device:CPU:0", send_device_incarnation=2331687198925880554, tensor_name="edge_2323_save/Identity", tensor_type=DT_STRING, _device="/job:worker/replica:0/task:0/device:CPU:0"]()]]

Everytime I train , I always direct the out_dir to the same directiry . I that the reason?

srinivas-varadharajan commented 5 years ago

Even I'm running into the same issue. Have anyone found a solution?

rsepassi commented 5 years ago

Hi all, could somebody please try exactly the commands in the documentation with the directories on GCS or NFS? It’s hard to debug any of the reported issues when exact env vars, directories, and commands are not provided. Let’s first establish whether the documentation commands work. I tried them myself on GCP and all worked fine. On Mon, Aug 27, 2018 at 1:02 PM billy19murahmi notifications@github.com wrote:

Even I'm running into the same issue. Have anyone found a solution?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/975#issuecomment-416350183, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGWzkjXsOAshafGxFL1vaZZCXvaL-Xks5uVFBkgaJpZM4VxmAm .

vsuthichai commented 5 years ago

@rsepassi I ended up integrating the code with Horovod. It seems to work well for single node. I'm hoping the scaling efficiency is decent when I move to more nodes. However, I've abandoned the parameter server for now since it was difficult to get up and running and it didn't scale well after I had it working. I'll provide the scripts I used to start the chief and ps. Hopefully it helps. I did get this to work but only with this extra fix to start the server on the chief. https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866. Without it I didn't have any luck. Additionally, if I specify "master" instead of "chief" for the job within the TF_CONFIG env var, it doesn't work for me and the master worker just hangs. This is from following the distributed training documentation exactly.

Chief

#!/bin/bash

TS=`date +%Y%m%d-%H%M%S`
PROBLEM=translate_ende_wmt32k2014
MODEL=transformer
#HPARAMS=transformer_big_single_gpu
HPARAMS=transformer_big
NUM_GPUS=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

T2T_USR_DIR=$HOME/benchmarks/scripts/register_wmt14
DATA_DIR=$HOME/benchmarks/scripts/wmt14_t2t
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=/myEFSvolume/t2t_train
#TRAIN_DIR=$HOME/benchmarks/scripts/t2t_train/$PROBLEM/$MODEL-$HPARAMS
#TRAIN_DIR=$HOME/benchmarks/scripts/t2t_train

#export GRPC_TRACE=api,channel,client_channel,server_channel
#export GRPC_VERBOSITY=DEBUG
export TF_CONFIG='{"cluster": {"ps": ["10.0.1.62:8000","10.0.1.184:8000"], "chief": ["10.0.0.205:8000"]}, "task": {"type": "chief", "index": 0}}'
rm -rf $TRAIN_DIR
mkdir -p $TRAIN_DIR

touch $TRAIN_DIR/start_time.txt

# Train
# *  If you run out of memory, add --hparams='batch_size=1024'.
PYTHONFAULTHANDLER=true t2t-trainer \
  --t2t_usr_dir=$T2T_USR_DIR \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --train_steps=4732 \
  --master=grpc://10.0.0.205:8000 --ps_replicas=2 --worker_replicas=1 --worker_gpu=2 --worker_id=0 --ps_gpu=8 --sync --schedule=train --worker_job='/job:chief'

PS 1 & 2 (same script for both PS except the task index is either 0 or 1)

#!/bin/bash

#set -x

PROBLEM=translate_ende_wmt32k2014
MODEL=transformer
#HPARAMS=transformer_base_single_gpu
#HPARAMS=transformer_big_single_gpu
HPARAMS=transformer_big
#HPARAMS=transformer_base
NUM_GPUS=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

DATA_DIR=$HOME/benchmarks/scripts/wmt14_t2t
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=/myEFSvolume/t2t_train

#export GRPC_TRACE=api,channel,client_channel,server_channel
#export GRPC_VERBOSITY=DEBUG
export TF_CONFIG='{"cluster": {"ps": ["10.0.1.62:8000","10.0.1.184:8000"], "chief": ["10.0.0.205:8000"]}, "task": {"type": "ps", "index": 0}, "environment": "cloud"}'

# Train
# *  If you run out of memory, add --hparams='batch_size=1024'.
t2t-trainer \
  --schedule=run_std_server
srinivas-varadharajan commented 5 years ago

@rsepassi

I'm trying to train it on 3 CPU nodes (40 cores each).

OS:

pip freeze | grep tensor tensor2tensor==1.8.0 tensorboard==1.9.0 tensorflow==1.9.0

Environment:

DATA_DIR=$HOME/t2t_experiments/t2t_data TRAIN_DIR=$HOME/machine_translation/experiments/multi_node

Master:

export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["skl044:2222", "skl042:2222"], "master": ["skl041:2222"]}, "task": {"index": 0, "type": "master"}}'

t2t-trainer \ --data_dir=$DATA_DIR \ --master=grpc://skl041:2222 \ --ps_replicas=2 \ --worker_replicas=1 \ --worker_gpu=1 \ --worker_id=0 \ --ps_gpu=1 \ --sync \ --schedule=train \ --worker_job='/job:master' \ --model=transformer \ --hparams_set=transformer_base \ --problem=translate_ende_wmt32k \ --output_dir=$TRAIN_DIR \ --train_steps=50000 \ --hparams='batch_size=4096'

PS:

export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["skl044:2222", "skl042:2222"], "master": ["skl041:2222"]}, "task": {"index": 0, "type": "ps"}}'

t2t-trainer \ --data_dir=$DATA_DIR \ --master=grpc://skl041:2222 \ --schedule=run_std_server \ --model=transformer \ --hparams_set=transformer_base \ --problem=translate_ende_wmt32k \ --output_dir=$TRAIN_DIR \ --train_steps=50000 \ --hparams='batch_size=4096'

export TF_CONFIG='{"environment": "cloud", "cluster": {"ps": ["skl044:2222", "skl042:2222"], "master": ["skl041:2222"]}, "task": {"index": 1, "type": "ps"}}'

t2t-trainer \ --data_dir=$DATA_DIR \ --master=grpc://skl041:2222 \ --schedule=run_std_server \ --model=transformer \ --hparams_set=transformer_base \ --problem=translate_ende_wmt32k \ --output_dir=$TRAIN_DIR \ --train_steps=50000 \ --hparams='batch_size=4096'

Note: 1. I've mentioned node instead of ip here. Works the same anyways.

  1. I'm using NFS. I followed the documentation but I'm unsure If I have to change any parameters since I'm not using the cloud.
srinivas-varadharajan commented 5 years ago

@rsepassi

I've followed the documentation exactly. Training gets stuck after "Graph was finalized". My previous comment has the environment vairables, dir and commands that were run.

I have a few questions:

  1. Since I'm using only CPU's, I'm not sure if I should modify parameters that are autogenerated by t2t-make-tf-configs. For eg: ps_gpu, worker_gpu.
  2. Should the "environment": "cloud", changed to something else since i'm using NFS?
  3. After setting environment variables, export GRPC_TRACE=all export GRPC_VERBOSITY=DEBUG I got the below logs. From the logs it looks like master is trying to connect to itself. (10.144.1.41 is the ip of the master) and I didn't find any 'binding' or 'listen' when I grep-ed the logs.
  4. I'm stuck here. Any direction on how to proceed would be really helpful. :)

Part of the logs on Master:

D0829 16:39:32.097960309 147740 pick_first.cc:399] Pick First 0x2aabac001070 connectivity changed for subchannel 0x2aabac001bf0 (0 of 1), subchannel_list 0x2aabac000a00: state=TRANSIENTF->shutdown=0 sd->subchannel_list->shutting_down=0 error={"created":"@1535578772.097786461","description":"Connect Failed","file":"external/grpc/src/core/ext/filters/client_channel/subchannel.cc","e":641,"grpc_status":14,"referenced_errors":[{"created":"@1535578772.097710816","description":"Failed to connect to remote host: OS Error","errno":111,"file":"external/grpc/src/core/lib/iomgr/tcp_csix.cc","file_line":201,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:10.144.1.41:2222"}]}

Part of the logs from one of the workers (both workers have similar messages), Below set of logs keep repeating

D0829 16:39:03.511618190 381245 timer_generic.cc:665] TIMER CHECK END: r=1; next=676304 D0829 16:39:03.511629226 381245 timer_manager.cc:175] sleep for a 1001 milliseconds D0829 16:39:04.514748296 381245 timer_manager.cc:191] wait ended: was_timed:1 kicked:0 D0829 16:39:04.514785613 381245 timer_generic.cc:647] TIMER CHECK BEGIN: now=676306 next=9223372036854775807 tls_min=6 D0829 16:39:04.514813166 381245 timer_generic.cc:566] .. shard[0]->min_deadline = 676304 D0829 16:39:04.514816408 381245 timer_generic.cc:503] .. shard[0]: heap_empty=true D0829 16:39:04.514819375 381245 timer_generic.cc:478] .. shard[0]->queue_deadline_cap --> 677306 D0829 16:39:04.514822344 381245 timer_generic.cc:543] .. shard[0] popped 0 D0829 16:39:04.514825587 381245 timer_generic.cc:583] .. result --> 1, shard[0]->min_deadline 676304 --> 677307, now D0829 16:39:04.514829121 381245 timer_generic.cc:665] TIMER CHECK END: r=1; next=677307 D0829 16:39:04.514832113 381245 timer_manager.cc:175] sleep for a 1001 milliseconds D0829 16:39:05.517747827 381245 timer_manager.cc:191] wait ended: was_timed:1 kicked:0 D0829 16:39:05.517785775 381245 timer_generic.cc:647] TIMER CHECK BEGIN: now=677309 next=9223372036854775807 tls_min=6

Note: All the CPU Nodes can communicate between each other using any port. I "ping"-ed from master to worker and worker to master, it's able to receive packets fine.

srinivas-varadharajan commented 5 years ago

@lukaszkaiser @rsepassi Did you get a chance to check this? Any direction to proceed would be helpful. I'm stuck at this.

rsepassi commented 5 years ago

We don’t currently have the bandwidth to dig into this but we highly recommend training on a single machine with 8 GPUs or a single Cloud TPU. That works well for most use cases. Hopefully this will all be simplified when we move to DistributionStrategy. On Thu, Sep 6, 2018 at 10:54 AM billy19murahmi notifications@github.com wrote:

@lukaszkaiser https://github.com/lukaszkaiser @rsepassi https://github.com/rsepassi Did you get a chance to check this? Any direction to proceed would be helpful. I'm stuck at this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/975#issuecomment-419184974, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGW9A5A_eTZWvEMAkURUEMU7nD26cOks5uYWFlgaJpZM4VxmAm .

cwbeitel commented 5 years ago

@Mack-y @billy19murahmi @vsuthichai @1nsunym So I was observing the same issue with master hanging after initializing the graph but only after recently updating to HEAD, then the problem was resolved with reverting to 49e279eb6c871fbebc137d6f598758a275f521c3. Maybe that will be a temporary fix until this DistributionStrategy thing rolls out.

But yeah as Ryan is saying you might save a lot of effort and do just fine with 8x GPU or a TPU.

Mack-y commented 5 years ago

Before I run the training among two or three machines, I had tried the single node with 4 GPUs which worked well for the training. So single machine seems not to have such problems as the distributed training does.

etragas-fathom commented 5 years ago

Hey @rsepassi, you mentioned t2t moving towards distributionstrategy. Are there any updates on plans for the repo? I've seen some stuff re MirroredStrategy in the code already, but haven't seen anything regard multi worker training.

fengrussell commented 5 years ago

I modified the source of tensor2tensor, now it supports horovod, git link here.