Open xxww opened 4 years ago
tl;dr you want worker_split_size=1
A split is one minibatch of inputs. With worker_split_size=2 you're saying that each minibatch should be given two gpus, but in this case the model you're using needs to explicitly set the devices to take advantage of the two gpus
example bidi-rnn where the forward and backward rnns are on different devices https://github.com/tensorflow/lingvo/blob/ac6adce5ba868e45f115781bb74001db26ce0195/lingvo/core/rnn_layers.py#L476
The number of minibatches being processed concurrently will be worker_replicas * worker_gpus / worker_split_size.
Thanks for the response.
I tried the worker_split_size=1 trick.
So be specific: My whole command line is as such:
python -m lingvo.trainer \
--logdir=$LOG_DIR \
--model=imagenet.Imagenet \
--resnet_depth=34 \
--run_locally=gpu \
--tfrecord_pattern=$TFRECORD \
--mode=async \
--worker_gpus=2 \
--worker_split_size=1 \
And I also tried both mode=sync and mode=async with worker_split_size=1.
When "mode=sync, worker_gpus=2, worker_split_size=1", I ran into this issue:
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1619, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: slice index 0 of dimension 0 out of bounds. for 'strided_slice' (op: 'StridedSlice') with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.
When "mode=async, worker_gpus=2, worker_split_size=1", I ran into the same issue as before:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1855, in
For the sync case, can you paste more of the error (eg. where the slice operation is defined)
For the async case, I think there is a bug here
Should be
elif FLAGS.mode == 'async':
FLAGS.job = 'controller,trainer'
else:
FLAGS.job = 'controller,trainer_client'
You can try to fix this locally.
I will patch your change in async asap. Meantime, here comes the full error log for "mode=sync":
I can confirm that after packing the the mode=async code path, it will end up with the same error as mode=sync above.
It looks like it's trying to split the input batch onto the two devices and failing.
Can you make sure that everything in the input batch being returned from your input generator has a leading batch dimension?
One way to check is to add here https://github.com/tensorflow/lingvo/blob/1aba0c93ae9592af88e43460aad9d19fa5b87e5a/lingvo/core/base_input_generator.py#L328
for k, v in batch.FlattenItems():
print(k, v.shape)
Thank you so much. Upon printing these tensors, I figure out I have the following offender:
def InputBatch(self): batch = py_utils.NestedMap() batch.bucket_keys = 1 # commenting this out will fix it batch.rgb = self._rgb batch.label = self._label return batch
I am trying to train a simple 34-layer Resnet model with the Imagenet dataset on a machine with multiple GPU cards (all v100s). I am using Lingvo-0.6.2 synced from github with TF 2.1.0 with Ubuntu 18.04.
I initially tried this setting: "--mode=sync --worker_gpus=2 --worker_split_size=2" From nvidia-smi, I can see that both GPUs are used by Lingvo. But I am getting exactly the same speed as what I got from a single GPU. So basically it worked, but came with no performance gain. I also checked the other thread, and I saw CPUs are pretty idle, so definitely it is not bottlenecked at the CPU tasks such as reading examples.
Then I tried this setting: "--mode=async --worker_gpus=2 --worker_split_size=2" Basically I switched to async training. And I got following error at the end. And I checked the source codes, it seems like for async, I will have to set up each job myself?
This is my first time using Lingvo and I hope you could offer some help here.
Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1855, in
tf.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1847, in main
RunnerManager(FLAGS.model).Start()
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1843, in Start
self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir))
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1590, in CreateRunners
trial)
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1545, in _CreateRunner
cfg = self.GetParamsForDataset('trainer_client', 'Train')
File "/usr/local/lib/python3.6/dist-packages/lingvo/trainer.py", line 1399, in GetParamsForDataset
with cluster_factory.Cluster(cluster.params):
File "/usr/local/lib/python3.6/dist-packages/lingvo/core/cluster.py", line 204, in init
assert False, (p.mode, p.job)
AssertionError: ('async', 'trainer_client')