multi-gpu training gets get stuck

ShoufaChen commented 1 year ago

Hi,

Thanks for your awesome code.

When I use four V100 GPUs, the program gets stuck as following:

(base)$ python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0,1,2,3 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes" data.batch_size=2 data.n_val_views=4

[WARNING] Timestamp is disabled when using multiple GPUs, please make sure you have a unique tag.
Global seed set to 0

[INFO] ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
[INFO] Using 16bit Automatic Mixed Precision (AMP)
[INFO] GPU available: True (cuda), used: True
[INFO] TPU available: False, using: 0 TPU cores
[INFO] IPU available: False, using: 0 IPUs
[INFO] HPU available: False, using: 0 HPUs
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
[rank: 1] Global seed set to 0
[rank: 3] Global seed set to 0
[rank: 2] Global seed set to 0
[rank: 3] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[rank: 2] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[rank: 1] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4

Could you please give some hints about this probmle? Thanks in advance.

thuliu-yt16 commented 1 year ago

Could you see two processes on one of the GPUs? Because the prompt processor will spawn a process to embed the text for all GPUs. Maybe the process got stuck.

ShoufaChen commented 1 year ago

Hi @thuliu-yt16 ,

Thanks for your reply. Could you give more details how to 'see two processes on one of the GPUs'?

There is no error when I use one gpu for training:

python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes"

bennyguo commented 1 year ago

could you try running with --verbose and see where it get stuck? this will print more information

ShoufaChen commented 1 year ago

Hi,

The log with --verbose is:

$ python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0,1 --verbose system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes" data.batch_size=2 data.n_val_views=2
[WARNING] Timestamp is disabled when using multiple GPUs, please make sure you have a unique tag.
Global seed set to 0
[INFO] ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved.
[DEBUG] Trainer: Initializing trainer with parameters: {'self': <pytorch_lightning.trainer.trainer.Trainer object at 0x7f867c38dd80>, 'accelerator': 'auto', 'strategy': 'auto', 'devices': 'auto', 'num_nodes': 1, 'precision': '16-mixed', 'logger': [<pytorch_lightning.loggers.tensorboard.TensorBoardLogger object at 0x7f867c38feb0>, <pytorch_lightning.loggers.csv_logs.CSVLogger object at 0x7f867c38dfc0>], 'callbacks': [<pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x7f867c38d000>, <pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor object at 0x7f867c38e020>, <threestudio.utils.callbacks.CustomProgressBar object at 0x7f867c38e0b0>, <threestudio.utils.callbacks.CodeSnapshotCallback object at 0x7f867c38fc10>, <threestudio.utils.callbacks.ConfigSnapshotCallback object at 0x7f867c38df30>], 'fast_dev_run': False, 'max_epochs': None, 'min_epochs': None, 'max_steps': 10000, 'min_steps': None, 'max_time': None, 'limit_train_batches': None, 'limit_val_batches': None, 'limit_test_batches': None, 'limit_predict_batches': None, 'overfit_batches': 0.0, 'val_check_interval': 200, 'check_val_every_n_epoch': 1, 'num_sanity_val_steps': 0, 'log_every_n_steps': 1, 'enable_checkpointing': None, 'enable_progress_bar': True, 'enable_model_summary': None, 'accumulate_grad_batches': 1, 'gradient_clip_val': None, 'gradient_clip_algorithm': None, 'deterministic': None, 'benchmark': None, 'inference_mode': False, 'use_distributed_sampler': True, 'profiler': None, 'detect_anomaly': False, 'barebones': False, 'plugins': None, 'sync_batchnorm': False, 'reload_dataloaders_every_n_epochs': 0, 'default_root_dir': None, '__class__': <class 'pytorch_lightning.trainer.trainer.Trainer'>}
[DEBUG] DDPStrategy: initializing DDP plugin
[INFO] Using 16bit Automatic Mixed Precision (AMP)
[INFO] GPU available: True (cuda), used: True
[INFO] TPU available: False, using: 0 TPU cores
[INFO] IPU available: False, using: 0 IPUs
[INFO] HPU available: False, using: 0 HPUs
[DEBUG] Trainer: trainer fit stage
[DEBUG] Trainer: preparing data
[DEBUG] Trainer: setting up strategy environment
[DEBUG] DDPStrategy: setting up distributed...
[rank: 0] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Global seed set to 0
[DEBUG] Trainer: Initializing trainer with parameters: {'self': <pytorch_lightning.trainer.trainer.Trainer object at 0x7f8e2cd92ef0>, 'accelerator': 'auto', 'strategy': 'auto', 'devices': 'auto', 'num_nodes': 1, 'precision': '16-mixed', 'logger': [<pytorch_lightning.loggers.tensorboard.TensorBoardLogger object at 0x7f8e2cd93eb0>, <pytorch_lightning.loggers.csv_logs.CSVLogger object at 0x7f8e2cd92350>], 'callbacks': [<pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x7f8e2cd90580>, <pytorch_lightning.callbacks.lr_monitor.LearningRateMonitor object at 0x7f8e2cd922f0>, <threestudio.utils.callbacks.CustomProgressBar object at 0x7f8e2cd92380>, <threestudio.utils.callbacks.CodeSnapshotCallback object at 0x7f8e2cd93c10>, <threestudio.utils.callbacks.ConfigSnapshotCallback object at 0x7f8e2cd92200>], 'fast_dev_run': False, 'max_epochs': None, 'min_epochs': None, 'max_steps': 10000, 'min_steps': None, 'max_time': None, 'limit_train_batches': None, 'limit_val_batches': None, 'limit_test_batches': None, 'limit_predict_batches': None, 'overfit_batches': 0.0, 'val_check_interval': 200, 'check_val_every_n_epoch': 1, 'num_sanity_val_steps': 0, 'log_every_n_steps': 1, 'enable_checkpointing': None, 'enable_progress_bar': True, 'enable_model_summary': None, 'accumulate_grad_batches': 1, 'gradient_clip_val': None, 'gradient_clip_algorithm': None, 'deterministic': None, 'benchmark': None, 'inference_mode': False, 'use_distributed_sampler': True, 'profiler': None, 'detect_anomaly': False, 'barebones': False, 'plugins': None, 'sync_batchnorm': False, 'reload_dataloaders_every_n_epochs': 0, 'default_root_dir': None, '__class__': <class 'pytorch_lightning.trainer.trainer.Trainer'>}
[DEBUG] DDPStrategy: initializing DDP plugin
[DEBUG] Trainer: trainer fit stage
[DEBUG] Trainer: preparing data
[DEBUG] Trainer: setting up strategy environment
[DEBUG] DDPStrategy: setting up distributed...
[rank: 1] Global seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

ShoufaChen commented 1 year ago

I solved it by modify os.environ["MASTER_ADDR"] from default '127.0.0.1' to 'localhost', in file .../site-packages/lightning_fabric/utilities/distributed.py: def _init_dist_connection

bennyguo commented 1 year ago

Glad it's solved. But it's a very weird problem (and a very weird solution too). Did you turn on any network proxy?

threestudio-project / threestudio

multi-gpu training gets get stuck #81