Hanging when running transformer in distributed setting

jinliangwei commented 6 years ago

Description

I am trying to run transformer using 1 worker and 1 ps in async mode. The program hangs after printing INFO:tensorflow:Graph was finalized.

Environment information

OS: Ubuntu 16.04

$ pip freeze | grep tensor
 -e git+https://github.com/tensorflow/tensor2tensor.git@342e214dea360a7f472fc82f3dd0775d7e224c52#egg=tensor2tensor
tensorboard==1.8.0
tensorflow==1.8.0
tensorflow-tensorboard==1.5.0
$ python -V
 Python 3.5.2

For bugs: reproduction and error logs

# Steps to reproduce:
# command to start worker:
TF_CONFIG='{"task": {"type": "master", "index": 0}, "environment": "cloud", "cluster": {"ps": ["10.117.1.30:12000"], "master": ["10.117.1.30:11000"]}}' ./tensor2tensor/bin/t2t-trainer --hparams_set=transformer_base --model=transformer --output_dir=$OUTPUT_DIR
 --tmp_dir=$TMP_DIR --data_dir=$DATA_DIR/translate_ende_wmt32k --problem=translate_ende_wmt32k --worker_replicas=1 --ps_gpu=0 --worker_job=/job:master --master=grpc://10.117.1.30:11000 --schedule=train --ps_replicas=1 --worker_gpu=1 --worker_id=0

# command to start ps:
TF_CONFIG='{"task": {"type": "ps", "index": 0}, "environment": "cloud", "cluster": {"ps": ["10.117.1.30:12000"], "master": ["10.117.1.30:11000"]}}' CUDA_VISIBLE_DEVICES='' ./tensor2tensor/bin/t2t-trainer --hparams_set=transformer_base --model=transformer --output_dir=$OUTPUT_DIR
 --tmp_dir=$TMP_DIR --data_dir=$DATA_DIR/translate_ende_wmt32k --problem=translate_ende_wmt32k --master=grpc://10.117.1.30:12000 --schedule=run_std_server

# Error logs (the last a couple of lines):
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_33945_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_33945_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_33945_512.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 61499904
INFO:tensorflow:Using optimizer Adam
/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.

I think I've located where the problem happens: the worker makes an RPC call for CreateSession (in grpc_remote_master.cc) but the call was not handled by an RPC server.

Prior to that, the worker process created a GRPC channel on address 10.117.1.30:11000 (this is the master address for this worker process), which was used to create the master GRPC stub. But there was no GRPC server created listening on this address.

So my question is: should the worker have created a GRPC master service to listen on 10.117.1.30:11000 but it didn't?

jinliangwei commented 6 years ago

Update:

A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.

It would be great if someone could comment on whether this is a proper fix.

libeineu commented 6 years ago

Hi, I met a problem when I use four machines to distributed train t2t model in MT task. There is 1 master(to update the parameters) and 3 ps(to compute the gradients), each one has 8 gpus. I make the both 4 machines use the same port 5000 to communicate ,but the training speed is too slow,200 secs per 100 steps(single machine is 55 secs ). I think there must be something wrong , but I have no idea now. I wonder should I set each machines with a different port number ?

Mack-y commented 6 years ago

@jinliangwei hello，I met the same problem. Could you tell me how you solved it? thks

jinliangwei commented 6 years ago

@Mack-y Hi, sorry for the late reply. Basically, in trainer_lib.py, I created a tf.train.Server in create_experiment() if it's distributed training and the schedule is not run_std_server. That's all.

Code to create server:

def create_tf_server(config):                                                                                                                                                                                                                                                                                                                                                                                                                                                  
  server = tf.train.Server(
    config.cluster_spec,
    job_name=config.task_type,
    task_index=config.task_id,
    config=config.tf_config,
    start=True)
  return server

harvey1994 commented 6 years ago

@jinliangwei please give more details about the tensor2tensor distributed training solutions concerning create_experiment() and tf.train.server

jinliangwei commented 6 years ago

@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866

jlewi commented 5 years ago

I ran into similar problems. I also found that with T2T 1.10.0, if you try to set schedul to run_std_server it will crash; details here https://github.com/kubeflow/examples/issues/208#issuecomment-436720653

Here's some info on how I worked around this: https://github.com/kubeflow/examples/issues/208#issuecomment-436846074

upwindflys commented 5 years ago

@jinliangwei Thanks for your solution ,but I encountered an oom error,in the meanwhile,free memory is enough,tensorflow and tensor2tensor are both 1.8,Can you offer some suggestions?

jinliangwei commented 5 years ago

@upwindflys One possible cause is that you are running on GPUs but your T2T model is using float16 instead of float32 (check hparams like activation_dtype). Not all TensorFlow operations have a float16 implementation on GPU and those operations are allocated on a special device called XLA_GPU, which has little memory.

upwindflys commented 5 years ago

@jinliangwei thanks,but that doesn't seem to be this problem,thanks anyway.

upwindflys commented 5 years ago

@jinliangwei sorry to ask you a question again.What kind of physical environment do you use?such as the cuDNN CUDA version?

outstandingcandy commented 5 years ago

Update:

A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.

It would be great if someone could comment on whether this is a proper fix.

I met the same problem in the current version tensor2tensor and fixed it by @jinliangwei 's method. Could tensor2tensor team fix it officially?

colmantse commented 5 years ago

I met the same problem as well. interestingly tho, estimator are supposed to handle the start_std_server in tf.estimator.training but for some reasons, it is not able to in current t2t.

jiahuigeng commented 4 years ago

@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866

self._server = server
where was self._server used? can you share the complete file?

tensorflow / tensor2tensor