Open jinliangwei opened 6 years ago
Update:
A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.
It would be great if someone could comment on whether this is a proper fix.
Hi, I met a problem when I use four machines to distributed train t2t model in MT task. There is 1 master(to update the parameters) and 3 ps(to compute the gradients), each one has 8 gpus. I make the both 4 machines use the same port 5000 to communicate ,but the training speed is too slow,200 secs per 100 steps(single machine is 55 secs ). I think there must be something wrong , but I have no idea now. I wonder should I set each machines with a different port number ?
@jinliangwei hello,I met the same problem. Could you tell me how you solved it? thks
@Mack-y Hi, sorry for the late reply. Basically, in trainer_lib.py, I created a tf.train.Server
in create_experiment()
if it's distributed training and the schedule is not run_std_server
. That's all.
Code to create server:
def create_tf_server(config):
server = tf.train.Server(
config.cluster_spec,
job_name=config.task_type,
task_index=config.task_id,
config=config.tf_config,
start=True)
return server
@jinliangwei please give more details about the tensor2tensor distributed training solutions concerning create_experiment() and tf.train.server
@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866
I ran into similar problems. I also found that with T2T 1.10.0, if you try to set schedul to run_std_server it will crash; details here https://github.com/kubeflow/examples/issues/208#issuecomment-436720653
Here's some info on how I worked around this: https://github.com/kubeflow/examples/issues/208#issuecomment-436846074
@jinliangwei Thanks for your solution ,but I encountered an oom error,in the meanwhile,free memory is enough,tensorflow and tensor2tensor are both 1.8,Can you offer some suggestions?
@upwindflys One possible cause is that you are running on GPUs but your T2T model is using float16 instead of float32 (check hparams like activation_dtype). Not all TensorFlow operations have a float16 implementation on GPU and those operations are allocated on a special device called XLA_GPU, which has little memory.
@jinliangwei thanks,but that doesn't seem to be this problem,thanks anyway.
@jinliangwei sorry to ask you a question again.What kind of physical environment do you use?such as the cuDNN CUDA version?
Update:
A tensorflow.train.Server object is created only if the schedule is "run_std_server". I modified the code to create a server object for workers in distributed training. The program doesn't hang anymore and training seems normal now.
It would be great if someone could comment on whether this is a proper fix.
I met the same problem in the current version tensor2tensor and fixed it by @jinliangwei 's method. Could tensor2tensor team fix it officially?
I met the same problem as well. interestingly tho, estimator are supposed to handle the start_std_server in tf.estimator.training but for some reasons, it is not able to in current t2t.
@harvey1994 See here for my patch that gets distributed tensor2tensor working: https://gist.github.com/jinliangwei/eed8fd564a35deae3b892092f3171866
Description
I am trying to run transformer using 1 worker and 1 ps in async mode. The program hangs after printing
INFO:tensorflow:Graph was finalized
.Environment information
For bugs: reproduction and error logs
I think I've located where the problem happens: the worker makes an RPC call for CreateSession (in grpc_remote_master.cc) but the call was not handled by an RPC server.
Prior to that, the worker process created a GRPC channel on address 10.117.1.30:11000 (this is the master address for this worker process), which was used to create the master GRPC stub. But there was no GRPC server created listening on this address.
So my question is: should the worker have created a GRPC master service to listen on 10.117.1.30:11000 but it didn't?