Open fanlu opened 6 years ago
On which OS are you training your model? I have in my logs the same output almost. My concern is about having the
TF GPU device with id 0 was not registered
I'm not sure but may be this message can be ignored. As I have seen a lot of printed logs having this message on github and on slack
About your first question: it is normal that 3 sessions are started when using 2 GPUs. There is one for each GPU worker and one for the parameter server that holds the parameters.
About the second question: I will check it out when I have the time
@AzizCode92 I use docker with Centos version 7.4, I have met
TF GPU device with id 0 was not registered
either.
@vrenkens plus ps, the count is 4
2018-08-21 08:29:04.230358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-08-21 08:29:04.312654: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 0314965d18d48a0e with config:
2018-08-21 08:29:04.314325: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.314370: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317363: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317405: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.703447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-08-21 08:29:04.703531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-08-21 08:29:05.458095: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.458151: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462393: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462629: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470015: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470082: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 08:29:10.788696: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 843a2ba3e9062528 with config: gpu_options { allow_growth: true } allow_soft_placement: true
2018-08-21 08:29:12.074345: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session b7a26beb96e7ca5a with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.883478, learning rate: 0.001000
time elapsed: 7.911287 sec
peak memory usage: 22/22604 MB
WORKER 0: step 2/15600 loss: 3.766951, learning rate: 0.001000
time elapsed: 1.964389 sec
peak memory usage: 731/22604 MB
WORKER 0: step 4/15600 loss: 3.677392, learning rate: 0.000999
time elapsed: 1.568714 sec
peak memory usage: 884/22604 MB
WORKER 0: step 6/15600 loss: 3.674108, learning rate: 0.000999
time elapsed: 2.856795 sec
peak memory usage: 884/22604 MB
WORKER 0: step 8/15600 loss: 3.673018, learning rate: 0.000999
time elapsed: 1.789929 sec
peak memory usage: 884/22604 MB
WORKER 0: step 10/15600 loss: 3.652077, learning rate: 0.000999
time elapsed: 1.303749 sec
peak memory usage: 884/22604 MB
WORKER 0: step 12/15600 loss: 3.645280, learning rate: 0.000998
time elapsed: 1.514070 sec
peak memory usage: 884/22604 MB
WORKER 0: step 14/15600 loss: 3.627352, learning rate: 0.000998
time elapsed: 1.308973 sec
peak memory usage: 884/22604 MB
WORKER 0: step 16/15600 loss: 3.628558, learning rate: 0.000998
time elapsed: 1.871332 sec
peak memory usage: 884/22604 MB
2018-08-21 08:29:41.694464: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c98110034232ee37 with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 18/15600 loss: 3.619331, learning rate: 0.000997
time elapsed: 1.154132 sec
peak memory usage: 884/22604 MB
WORKER 0: step 20/15600 loss: 3.636102, learning rate: 0.000997
time elapsed: 1.139632 sec
peak memory usage: 884/22604 MB
@vrenkens When I use 3GPU, I found the number is 5, So I guess except ps and chief worker, on the other worker master session start twice on each gpu.
can you share please what is the output of
ps aux | grep python
@AzizCode92
root 3130930 3130927 0 16:28 pts/20 00:00:04 python nabu/scripts/prepare_train.py --recipe=config/recipes/LAS/TIMIT --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2 --mode=single_machine
root 3130987 3130930 22 16:28 pts/20 00:02:18 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=ps --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root 3130988 3130930 99 16:28 pts/20 00:34:05 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root 3130989 3130930 99 16:28 pts/20 00:32:22 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=1 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
I doubt that this is the case. I have only tested it with 2 workers and 1 ps, because I don't have bigger machine's to play with. The overview of you jobs look normal. From your first post it seems that both workers are working, no?
See here the task_index=0 is shared between the worker and the ps. I thought that the the ps will be working on a different task
@AzizCode92 The task_name is different. The indices for different task names can be the same
@vrenkens Yes, the first two master session start, only one worker is working, when last master session start, the second GPU start working.
That is pretty normal behavior, it can take a while before all workers are online.
Yes, So pls help me find the sync replica optimizer can‘t applyer_gradient error. Why do you use increment_step op significantly, I think it's odd. How about tf.train.get_or_create_global_step()?
I will take a look when I have the time
Hi, everyone:
On the tf website, SyncReplicasOptimizer has a sync_replicas_hook
, that should be managed by MonitoredTrainingSession
.
hi Vincent, I have another problem, I use 0 1 as training GPU and numbatches_to_aggregate=0 in default config standardtrainer.cfg , but I found 3 Start master session in log. Is this behavior right?
In Contrast, When I set numbatches_to_aggregate=2 use sync replicas optimizer, there is a error msg like
So I add param global_step to apply_gradients_op in func _update
and start training, but there is no training log print anymore. How to set global_step to apply_gradients_op? @vrenkens