tensorflow / lingvo

Lingvo
Apache License 2.0
2.81k stars 445 forks source link

A problem of distributed training #134

Open maoxuepeng opened 5 years ago

maoxuepeng commented 5 years ago

sync mode

server info: controller: server0:2222, 8gpus trainer_client: server1:2222, worker0: server2:2222, 8gpus worker1: server3:2222, 8gpus worker2: server4:2222, 8gpus

follow the setting of "run_distributed.py", it doesn't work. server0# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=controller--task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server1# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=trainer_client --task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server2# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server3# bazel-bin/lingvo/trainer -- cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=1 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server4# bazel-bin/lingvo/trainer -- cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=2 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log

It seems training job run on worker0 only, and GPU utilization is very low. Could you please let me know how to set all these arguments below in my case. --controller_gpus: Number of controller GPUs. (default: '0') (an integer) --worker_gpus: Number of gpus to use per replica. (default: '0') (an integer) --worker_replicas: Number of replicas. (default: '1') (an integer) --worker_split_size: Number of devices for one split. (default: '1') (an integer)

Thanks a billion!

jonathanasdf commented 5 years ago

Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that will be greatly appreciated, otherwise we will take a look when we have time.

maoxuepeng commented 5 years ago

Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that will be greatly appreciated, otherwise we will take a look when we have time.

Thank you for your reply. Ignore that run_distributed. Actually, This is a problem of "lingvo/trainer.py", right?
Could you please let me know how to set all these arguments below in my case. --controller_gpus: Number of controller GPUs. (default: '0') (an integer) --worker_gpus: Number of GPUs to use per replica. (default: '0') (an integer) --worker_replicas: Number of replicas. (default: '1') (an integer) --worker_split_size: Number of devices for one split. (default: '1') (an integer)

jonathanasdf commented 5 years ago

if you have 3 workers with 8 gpus each you should have worker_gpus=8 worker_replicas=3 and the rest default

manish-kumar-garg commented 4 years ago

@jonathanasdf What is the meaning of worker_gpus and worker_replicas here? Can you share some references to understand this better?

jonathanasdf commented 4 years ago

It needs to be the same as your physical cluster setup. worker_replicas is the number of training worker jobs you are running. worker_gpus is the number of gpus each training worker job uses.