vrenkens / nabu

Code for end-to-end ASR with neural networks, build with TensorFlow
MIT License
108 stars 43 forks source link

sync optimizer problem #43

Open fanlu opened 6 years ago

fanlu commented 6 years ago

hi Vincent, I have another problem, I use 0 1 as training GPU and numbatches_to_aggregate=0 in default config standardtrainer.cfg , but I found 3 Start master session in log. Is this behavior right?

2018-08-21 07:20:16.556159: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered          [286/1991]
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 07:20:22.265699: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 6066331584d0d493 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
2018-08-21 07:20:22.711547: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 51400be6a5c9e9bc with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.890086, learning rate: 0.001000 
         time elapsed: 8.318348 sec
         peak memory usage: 21/22604 MB
WORKER 0: step 1/15600 loss: 3.776481, learning rate: 0.001000 
         time elapsed: 1.858370 sec
         peak memory usage: 688/22604 MB
WORKER 0: step 2/15600 loss: 3.670535, learning rate: 0.001000 
         time elapsed: 1.519805 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 3/15600 loss: 3.695373, learning rate: 0.001000 
         time elapsed: 1.608191 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 4/15600 loss: 3.627747, learning rate: 0.000999 
         time elapsed: 2.627351 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 5/15600 loss: 3.646843, learning rate: 0.000999 
         time elapsed: 1.744121 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 6/15600 loss: 3.629820, learning rate: 0.000999 
         time elapsed: 1.468782 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 7/15600 loss: 3.638589, learning rate: 0.000999 
         time elapsed: 1.179851 sec
         peak memory usage: 893/22604 MB
2018-08-21 07:20:52.899330: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c5b480c57e96eea5 with config: gpu_optios { allow_growth: true } allow_soft_placement: true
WORKER 0: step 8/15600 loss: 3.618140, learning rate: 0.000999 
         time elapsed: 1.060105 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 9/15600 loss: 3.596526, learning rate: 0.000999 
         time elapsed: 2.330892 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 10/15600 loss: 3.601762, learning rate: 0.000999 
         time elapsed: 1.179645 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 11/15600 loss: 3.615230, learning rate: 0.000998 
         time elapsed: 0.954928 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 12/15600 loss: 3.602469, learning rate: 0.000998 
         time elapsed: 1.251472 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 13/15600 loss: 3.601032, learning rate: 0.000998 
         time elapsed: 0.886364 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 14/15600 loss: 3.603386, learning rate: 0.000998 
         time elapsed: 1.565970 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 15/15600 loss: 3.622668, learning rate: 0.000998 
         time elapsed: 0.885060 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 16/15600 loss: 3.613049, learning rate: 0.000998 
         time elapsed: 1.336841 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 17/15600 loss: 3.610620, learning rate: 0.000997 
         time elapsed: 0.818247 sec
         peak memory usage: 893/22604 MB
WORKER 0: step 18/15600 loss: 3.618244, learning rate: 0.000997 
         time elapsed: 0.795765 sec
         peak memory usage: 893/22604 MB
WORKER 1: step 17/15600 loss: 3.590606, learning rate: 0.000997 
         time elapsed: 8.500966 sec
         peak memory usage: 893/22604 MB

In Contrast, When I set numbatches_to_aggregate=2 use sync replicas optimizer, there is a error msg like

Traceback (most recent call last):
  File "nabu/scripts/train.py", line 112, in <module>
    testing=False)
  File "nabu/scripts/train.py", line 90, in train
    tr.train(testing)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 607, in train
    outputs = self._create_graph()
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 187, in _create_graph
    cluster=cluster)
  File "/opt/cephfs1/asr/users/fanlu/mfs/nabu/nabu/neuralnetworks/trainers/trainer.py", line 569, in _update
    name='apply_gradients')
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 238, in apply_gradients
    raise ValueError("Global step is required to check staleness")
ValueError: Global step is required to check staleness

So I add param global_step to apply_gradients_op in func _update

# opperation to apply the gradients
        apply_gradients_op = optimizer.apply_gradients(
            grads_and_vars=grads_and_vars,
            name='apply_gradients', global_step=global_step)

and start training, but there is no training log print anymore. How to set global_step to apply_gradients_op? @vrenkens

AzizCode92 commented 6 years ago

On which OS are you training your model? I have in my logs the same output almost. My concern is about having the

TF GPU device with id 0 was not registered

I'm not sure but may be this message can be ignored. As I have seen a lot of printed logs having this message on github and on slack

vrenkens commented 6 years ago

About your first question: it is normal that 3 sessions are started when using 2 GPUs. There is one for each GPU worker and one for the parameter server that holds the parameters.

About the second question: I will check it out when I have the time

fanlu commented 6 years ago

@AzizCode92 I use docker with Centos version 7.4, I have met

TF GPU device with id 0 was not registered

either.

fanlu commented 6 years ago

@vrenkens plus ps, the count is 4

2018-08-21 08:29:04.230358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-08-21 08:29:04.312654: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 0314965d18d48a0e with config: 
2018-08-21 08:29:04.314325: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.314370: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317363: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.317405: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:04.703447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-08-21 08:29:04.703531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 21557 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:03:00.0, compute capability: 6.1)
2018-08-21 08:29:05.458095: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.458151: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462393: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.462629: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470015: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
2018-08-21 08:29:05.470082: E tensorflow/core/grappler/clusters/utils.cc:127] Not found: TF GPU device with id 0 was not registered
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-08-21 08:29:10.788696: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session 843a2ba3e9062528 with config: gpu_options { allow_growth: true } allow_soft_placement: true
2018-08-21 08:29:12.074345: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session b7a26beb96e7ca5a with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 0/15600 loss: 3.883478, learning rate: 0.001000 
         time elapsed: 7.911287 sec
         peak memory usage: 22/22604 MB
WORKER 0: step 2/15600 loss: 3.766951, learning rate: 0.001000 
         time elapsed: 1.964389 sec
         peak memory usage: 731/22604 MB
WORKER 0: step 4/15600 loss: 3.677392, learning rate: 0.000999 
         time elapsed: 1.568714 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 6/15600 loss: 3.674108, learning rate: 0.000999 
         time elapsed: 2.856795 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 8/15600 loss: 3.673018, learning rate: 0.000999 
         time elapsed: 1.789929 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 10/15600 loss: 3.652077, learning rate: 0.000999 
         time elapsed: 1.303749 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 12/15600 loss: 3.645280, learning rate: 0.000998 
         time elapsed: 1.514070 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 14/15600 loss: 3.627352, learning rate: 0.000998 
         time elapsed: 1.308973 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 16/15600 loss: 3.628558, learning rate: 0.000998 
         time elapsed: 1.871332 sec
         peak memory usage: 884/22604 MB
2018-08-21 08:29:41.694464: I tensorflow/core/distributed_runtime/master_session.cc:1136] Start master session c98110034232ee37 with config: gpu_options { allow_growth: true } allow_soft_placement: true
WORKER 0: step 18/15600 loss: 3.619331, learning rate: 0.000997 
         time elapsed: 1.154132 sec
         peak memory usage: 884/22604 MB
WORKER 0: step 20/15600 loss: 3.636102, learning rate: 0.000997 
         time elapsed: 1.139632 sec
         peak memory usage: 884/22604 MB
fanlu commented 6 years ago

@vrenkens When I use 3GPU, I found the number is 5, So I guess except ps and chief worker, on the other worker master session start twice on each gpu.

AzizCode92 commented 6 years ago

can you share please what is the output of

ps aux | grep python

fanlu commented 6 years ago

@AzizCode92

root     3130930 3130927  0 16:28 pts/20   00:00:04 python nabu/scripts/prepare_train.py --recipe=config/recipes/LAS/TIMIT --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2 --mode=single_machine
root     3130987 3130930 22 16:28 pts/20   00:02:18 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=ps --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root     3130988 3130930 99 16:28 pts/20   00:34:05 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=0 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
root     3130989 3130930 99 16:28 pts/20   00:32:22 python -u nabu/scripts/train.py --clusterfile=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2/cluster/cluster --job_name=worker --task_index=1 --ssh_command=None --expdir=/opt/cephfs1/asr/users/fanlu/mfs/timit_train2
vrenkens commented 6 years ago

I doubt that this is the case. I have only tested it with 2 workers and 1 ps, because I don't have bigger machine's to play with. The overview of you jobs look normal. From your first post it seems that both workers are working, no?

AzizCode92 commented 6 years ago

See here the task_index=0 is shared between the worker and the ps. I thought that the the ps will be working on a different task

vrenkens commented 6 years ago

@AzizCode92 The task_name is different. The indices for different task names can be the same

fanlu commented 6 years ago

@vrenkens Yes, the first two master session start, only one worker is working, when last master session start, the second GPU start working.

vrenkens commented 6 years ago

That is pretty normal behavior, it can take a while before all workers are online.

fanlu commented 6 years ago

Yes, So pls help me find the sync replica optimizer can‘t applyer_gradient error. Why do you use increment_step op significantly, I think it's odd. How about tf.train.get_or_create_global_step()?

vrenkens commented 6 years ago

I will take a look when I have the time

HawkAaron commented 6 years ago

Hi, everyone:

On the tf website, SyncReplicasOptimizer has a sync_replicas_hook, that should be managed by MonitoredTrainingSession.