Open maoxuepeng opened 5 years ago
Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that will be greatly appreciated, otherwise we will take a look when we have time.
Sorry, we are aware that run_distributed has some problems, but don't have the resources to fix it at the moment. If someone is able to create a pull request that will be greatly appreciated, otherwise we will take a look when we have time.
Thank you for your reply. Ignore that run_distributed. Actually, This is a problem of "lingvo/trainer.py", right?
Could you please let me know how to set all these arguments below in my case.
--controller_gpus: Number of controller GPUs.
(default: '0')
(an integer)
--worker_gpus: Number of GPUs to use per replica.
(default: '0')
(an integer)
--worker_replicas: Number of replicas.
(default: '1')
(an integer)
--worker_split_size: Number of devices for one split.
(default: '1')
(an integer)
if you have 3 workers with 8 gpus each you should have worker_gpus=8 worker_replicas=3 and the rest default
@jonathanasdf What is the meaning of worker_gpus and worker_replicas here? Can you share some references to understand this better?
It needs to be the same as your physical cluster setup. worker_replicas is the number of training worker jobs you are running. worker_gpus is the number of gpus each training worker job uses.
sync mode
server info: controller: server0:2222, 8gpus trainer_client: server1:2222, worker0: server2:2222, 8gpus worker1: server3:2222, 8gpus worker2: server4:2222, 8gpus
follow the setting of "run_distributed.py", it doesn't work. server0# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=controller--task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server1# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=trainer_client --task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server2# bazel-bin/lingvo/trainer --cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=0 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server3# bazel-bin/lingvo/trainer -- cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=1 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log server4# bazel-bin/lingvo/trainer -- cluster_spec=controller=server0:2222@trainer_client=server1:2222@worker=server2:2222,server3:2222,server4:2222 --job=worker--task=2 --mode=sync --logtostderr --model=asr.librispeech.Librispeech960Base --logdir=/tmp/sharedfs/log
It seems training job run on worker0 only, and GPU utilization is very low. Could you please let me know how to set all these arguments below in my case. --controller_gpus: Number of controller GPUs. (default: '0') (an integer) --worker_gpus: Number of gpus to use per replica. (default: '0') (an integer) --worker_replicas: Number of replicas. (default: '1') (an integer) --worker_split_size: Number of devices for one split. (default: '1') (an integer)
Thanks a billion!