Setting data_parallel: true results in an error

ghost commented 2 years ago

When setting data_parallel to "true" in the config, I end up with this as a result. I run this under a Jupyter notebook instance that is based on the ENUNU-Training-Kit with the dev2 branch of NNSVS.

Error executing job with overrides: ['model=acoustic_custom', 'train=myconfig', 'data=myconfig', 'data.train_no_dev.in_dir=dump/multi-gpu-test/norm/train_no_dev/in_acoustic/', 'data.train_no_dev.out_dir=dump/multi-gpu-test/norm/train_no_dev/out_acoustic/', 'data.dev.in_dir=dump/multi-gpu-test/norm/dev/in_acoustic/', 'data.dev.out_dir=dump/multi-gpu-test/norm/dev/out_acoustic/', 'train.out_dir=exp/multi-gpu-test_dynamivox_notebook/acoustic', 'train.resume.checkpoint=']
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 158, in <module>
    my_app()
  File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 141, in my_app
    train_loop(
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 100, in train_loop
    loss = train_step(
  File "/opt/conda/lib/python3.8/site-packages/nnsvs/bin/train.py", line 32, in train_step
    out_feats = model.preprocess_target(out_feats)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DataParallel' object has no attribute 'preprocess_target'
++ set +x

r9y9 commented 2 years ago

Thanks for reporting this. This is my bad. I will fix it soon.

r9y9 commented 2 years ago

This should be fixed on master now. Please let me know if the problem persists.

ghost commented 2 years ago

The problem appears to be fixed, thank you. Performance doesn't seem to scale well with the number of GPUs though. 4x GPUs is only about 1.5x as fast as 1 GPU even though all are being utilized. Could just be a CPU bottleneck on my side though. As even with a high number of workers only a few cores appear to be doing the majority of the work.

r9y9 commented 2 years ago

That's somewhat expected; only the single-process parallelism (using nn.DataParallel) is supported, and that's a bit slow (but easy to implement). We can use multi-process distributed training to further speed up multi-GPU training. Using distributed training doesn't mean we can see 4x speedup with 4x GPUs though. That's faster at least.

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#comparison-between-dataparallel-and-distributeddataparallel

I'll work on distributed training in the future but it's not a priority in my opinion. Using single GPU is enough for most cases.

nnsvs / nnsvs

Setting data_parallel: true results in an error #90