multi-gpu training gets get stuck

Piggy-ch commented 1 year ago

I noticed that #81 encountered a similar issue, but his solution didn't work for me. Here is my DEBUG information. Note: Training works fine with a single card, but not with multiple cards. This freezing issue typically occurs after I had a previous training session that I killed. When I try to train again, it gets stuck and doesn't proceed.

[rank: 1] Global seed set to 1
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
[DEBUG] DDPStrategy: initializing DDP plugin
[DEBUG] Trainer: trainer fit stage
[DEBUG] Trainer: preparing data
[DEBUG] Trainer: setting up strategy environment
[DEBUG] DDPStrategy: setting up distributed...
[rank: 2] Global seed set to 2
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
[DEBUG] DDPStrategy: initializing DDP plugin
[DEBUG] Trainer: trainer fit stage
[DEBUG] Trainer: preparing data
[DEBUG] Trainer: setting up strategy environment
[DEBUG] DDPStrategy: setting up distributed...
[rank: 3] Global seed set to 3
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
[INFO] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[DEBUG] Trainer: calling callback hook: setup
[DEBUG] Trainer: calling callback hook: setup
[DEBUG] Trainer: calling callback hook: setup
[DEBUG] Trainer: calling callback hook: setup
[DEBUG] Trainer: restoring module and callbacks from checkpoint path: None
[DEBUG] Trainer: restoring module and callbacks from checkpoint path: None
[DEBUG] Trainer: restoring module and callbacks from checkpoint path: None
[DEBUG] `checkpoint_path` not specified. Skipping checkpoint loading.
[DEBUG] `checkpoint_path` not specified. Skipping checkpoint loading.
[DEBUG] Trainer: restoring module and callbacks from checkpoint path: None
[DEBUG] `checkpoint_path` not specified. Skipping checkpoint loading.
[DEBUG] Trainer: configuring sharded model
[DEBUG] Trainer: configuring sharded model
[DEBUG] `checkpoint_path` not specified. Skipping checkpoint loading.
[DEBUG] Trainer: configuring sharded model
[DEBUG] Trainer: configuring sharded model
[INFO] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[INFO] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[INFO] LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[INFO] LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[DEBUG] DDPStrategy: moving model to device [cuda:0]...
[DEBUG] DDPStrategy: moving model to device [cuda:1]...
[DEBUG] DDPStrategy: moving model to device [cuda:2]...
[DEBUG] DDPStrategy: moving model to device [cuda:3]...
[DEBUG] DDPStrategy: configuring DistributedDataParallel
[DEBUG] setting up DDP model with device ids: [0], kwargs: {}
[DEBUG] DDPStrategy: configuring DistributedDataParallel
[DEBUG] setting up DDP model with device ids: [1], kwargs: {}
[DEBUG] DDPStrategy: configuring DistributedDataParallel
[DEBUG] setting up DDP model with device ids: [2], kwargs: {}
[DEBUG] DDPStrategy: configuring DistributedDataParallel
[DEBUG] setting up DDP model with device ids: [3], kwargs: {}
[DEBUG] DDPStrategy: registering ddp hooks
[DEBUG] DDPStrategy: registering ddp hooks
[DEBUG] DDPStrategy: registering ddp hooks
[DEBUG] DDPStrategy: registering ddp hooks
[DEBUG] Trainer: calling callback hook: on_fit_start
[DEBUG] Trainer: calling callback hook: on_fit_start
[DEBUG] Trainer: calling callback hook: on_fit_start
[DEBUG] Trainer: calling callback hook: on_fit_start
[INFO]
  | Name       | Type                 | Params
----------------------------------------------------
0 | geometry   | ImplicitSDF          | 12.6 M
1 | material   | NoMaterial           | 0
2 | background | SolidColorBackground | 0
3 | renderer   | NVDiffRasterizer     | 0
----------------------------------------------------
12.6 M    Trainable params
0         Non-trainable params
12.6 M    Total params
50.417    Total estimated model params size (MB)
[INFO] Validation results will be saved to outputs/fantasia3d/Stephen_Hawking/save
[INFO] Using prompt [Stephen Hawking] and negative prompt []
[INFO] Using view-dependent prompts [side]:[Stephen Hawking, side view] [front]:[Stephen Hawking, front view] [back]:[Stephen Hawking, back view] [overhead]:[Stephen Hawking, overhead view]
[DEBUG] Text embeddings for model stabilityai/stable-diffusion-2-1-base and prompt [] are already in cache, skip processing.
[DEBUG] Text embeddings for model stabilityai/stable-diffusion-2-1-base and prompt [] are already in cache, skip processing.
[DEBUG] Text embeddings for model stabilityai/stable-diffusion-2-1-base and prompt [] are already in cache, skip processing.
[DEBUG] Text embeddings for model stabilityai/stable-diffusion-2-1-base and prompt [] are already in cache, skip processing.
[DEBUG] Text embeddings for model stabilityai/stable-diffusion-2-1-base and prompt [] are already in cache, skip processing.

Piggy-ch commented 1 year ago

Please note that I previously encountered a situation where it got stuck at this point, and I resolved it by setting environment variables. I'm not sure if this issue is related to the environment variables I set.

[INFO] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

My solution ： export NCCL_IB_DISABLE=1; export NCCL_P2P_DISABLE=1;

Piggy-ch commented 1 year ago

Just to add, I am running the Fantasia3D model.

Piggy-ch commented 1 year ago

Note that this only occurs in the case of generating a specified OBJ file.It is stuck at the step before SDF initialization.

threestudio-project / threestudio

multi-gpu training gets get stuck #324