tensorflow / lingvo

Lingvo
Apache License 2.0
2.82k stars 445 forks source link

Using multiple GPUs for training on a single machine? #202

Open xxww opened 4 years ago

xxww commented 4 years ago

I am trying to train a simple 34-layer Resnet model with the Imagenet dataset on a machine with multiple GPU cards (all v100s). I am using Lingvo-0.6.2 synced from github with TF 2.1.0 with Ubuntu 18.04.

I initially tried this setting: "--mode=sync --worker_gpus=2 --worker_split_size=2" From nvidia-smi, I can see that both GPUs are used by Lingvo. But I am getting exactly the same speed as what I got from a single GPU. So basically it worked, but came with no performance gain. I also checked the other thread, and I saw CPUs are pretty idle, so definitely it is not bottlenecked at the CPU tasks such as reading examples.

Then I tried this setting: "--mode=async --worker_gpus=2 --worker_split_size=2" Basically I switched to async training. And I got following error at the end. And I checked the source codes, it seems like for async, I will have to set up each job myself?

This is my first time using Lingvo and I hope you could offer some help here.

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1855, in tf.app.run(main) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1847, in main RunnerManager(FLAGS.model).Start() File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1843, in Start self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir)) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1590, in CreateRunners trial) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1545, in _CreateRunner cfg = self.GetParamsForDataset('trainer_client', 'Train') File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1399, in GetParamsForDataset with cluster_factory.Cluster(cluster.params): File "/usr/local/lib/python3.6/dist-packages/lingvo/core/cluster.py", line 204, in init assert False, (p.mode, p.job) AssertionError: ('async', 'trainer_client')

jonathanasdf commented 4 years ago

tl;dr you want worker_split_size=1

A split is one minibatch of inputs. With worker_split_size=2 you're saying that each minibatch should be given two gpus, but in this case the model you're using needs to explicitly set the devices to take advantage of the two gpus

example bidi-rnn where the forward and backward rnns are on different devices https://github.com/tensorflow/lingvo/blob/ac6adce5ba868e45f115781bb74001db26ce0195/lingvo/core/rnn_layers.py#L476

The number of minibatches being processed concurrently will be worker_replicas * worker_gpus / worker_split_size.

xxww commented 4 years ago

Thanks for the response.

I tried the worker_split_size=1 trick.

So be specific: My whole command line is as such:

python -m lingvo.trainer \
--logdir=$LOG_DIR \
--model=imagenet.Imagenet \
--resnet_depth=34 \
--run_locally=gpu \
--tfrecord_pattern=$TFRECORD \
--mode=async \
--worker_gpus=2 \
--worker_split_size=1 \

And I also tried both mode=sync and mode=async with worker_split_size=1.

When "mode=sync, worker_gpus=2, worker_split_size=1", I ran into this issue:

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1619, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 0 of dimension 0 out of bounds. for 'strided_slice' (op: 'StridedSlice') with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.

When "mode=async, worker_gpus=2, worker_split_size=1", I ran into the same issue as before:

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1855, in tf.app.run(main) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1847, in main RunnerManager(FLAGS.model).Start() File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1843, in Start self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir)) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1590, in CreateRunners trial) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1545, in _CreateRunner cfg = self.GetParamsForDataset('trainer_client', 'Train') File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1399, in GetParamsForDataset with cluster_factory.Cluster(cluster.params): File "/usr/local/lib/python3.6/dist-packages/lingvo/core/cluster.py", line 204, in init assert False, (p.mode, p.job) AssertionError: ('async', 'trainer_client') ERROR:tensorflow:==================================

jonathanasdf commented 4 years ago

For the sync case, can you paste more of the error (eg. where the slice operation is defined)

For the async case, I think there is a bug here

https://github.com/tensorflow/lingvo/blob/1aba0c93ae9592af88e43460aad9d19fa5b87e5a/lingvo/trainer.py#L1677

Should be

elif FLAGS.mode == 'async':
  FLAGS.job = 'controller,trainer'
else:
  FLAGS.job = 'controller,trainer_client'

You can try to fix this locally.

xxww commented 4 years ago

I will patch your change in async asap. Meantime, here comes the full error log for "mode=sync":

error.log

xxww commented 4 years ago

I can confirm that after packing the the mode=async code path, it will end up with the same error as mode=sync above.

jonathanasdf commented 4 years ago

It looks like it's trying to split the input batch onto the two devices and failing.

Can you make sure that everything in the input batch being returned from your input generator has a leading batch dimension?

One way to check is to add here https://github.com/tensorflow/lingvo/blob/1aba0c93ae9592af88e43460aad9d19fa5b87e5a/lingvo/core/base_input_generator.py#L328

for k, v in batch.FlattenItems():
  print(k, v.shape)
xxww commented 4 years ago

Thank you so much. Upon printing these tensors, I figure out I have the following offender:

def InputBatch(self): batch = py_utils.NestedMap() batch.bucket_keys = 1 # commenting this out will fix it batch.rgb = self._rgb batch.label = self._label return batch