Issues about training with multi-gpus

fyb99 commented 1 month ago

Hi, dear siyi, appreciate it a lot for your solid work. I have a question when trasnferring your code to other datasets. Does your code supports training with multi-gpus, when I try to use 2 gpus, the program seems to be stuck.

siyi-wind commented 1 month ago

Hi, thanks for the interest to our paper. Yes, our code supports multiple gpus. Did you use the same environment as mentioned in the README? Can you share the error message?

fyb99 commented 1 month ago

Yes, I use the same environment. I use the command 'CUDA_VISIBLE_DEVICES=6,7 python -u run.py --config-name config_mydataset_TIP exp_name=pretrain'

The progress is stuck after printing the model architecture, full printing information is listed below,

Using TIP3Loss GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs `Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used.. `Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used.. Global seed set to 2022 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 Global seed set to 2022 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

/mnt/data2/xx/anaconda3/envs/tip/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py:364: LightningDeprecationWarning: SSLOnlineEvaluator.on_load_checkpoint will change its signature and behavior in v1.8. If you wish to load the state of the callback, use load_state_dict instead. In v1.8 on_load_checkpoint(..., checkpoint) will receive the entire loaded checkpoint dictionary instead of callback state. rank_zero_deprecation(

siyi-wind commented 1 month ago

Hi, what kind of gpus are you using? Do your gpus support the DDP training?

I've rerun my code and make sure it can support multiple GPUs. Could you use some Debug tools to see where the stucking position is?

fyb99 commented 1 month ago

Actually, I am using A6000 with the progress. It supports DDP.

After debugging, I find the code is stucking with the trainer definition '' trainer = Trainer.from_argparse_args(hparams, gpus=cuda.device_count(), callbacks=callbacks, logger=wandb_logger, max_epochs=hparams.max_epochs, check_val_every_n_epoch=hparams.check_val_every_n_epoch, limit_train_batches=hparams.limit_train_batches, limit_val_batches=hparams.limit_val_batches, enable_progress_bar=hparams.enable_progress_bar,)''

If I add parameter strategy='dp' to the function, like trainer = Trainer.from_argparse_args(hparams, gpus=cuda.device_count(), callbacks=callbacks, logger=wandb_logger, max_epochs=hparams.max_epochs, check_val_every_n_epoch=hparams.check_val_every_n_epoch, limit_train_batches=hparams.limit_train_batches, limit_val_batches=hparams.limit_val_batches, enable_progress_bar=hparams.enable_progress_bar, strategy='dp'), it can successfully run to the training stage. But the progress bar is stucking like the attached plot.

If I add parameter strategy='ddp', it raises another problem.

I guess the environment is same with your project, since I install it from the yaml file.

WX20241023-141258

siyi-wind commented 1 month ago

Hi, I guess this might because the unknown issue of pytorch-lightning. You could try to upgrade its version. Besides, have you reproduced MMCL? We used the same environment, but I utilized Albumentation to accelerate the training process.

fyb99 commented 1 month ago

Yes, I have reproduced MMCL, I can successfully run it with one gpu, currently I have not try to train MMCL with multi gpus. I have upgrade the version of pytorch-lightning, but it raises other errors before training, which seems to be caused by conflictions between various versions

fyb99 commented 1 month ago

When I do not set addtional strategy to the trainer, and add ‘’export NCCL_P2P_DISABLE=1 ‘’ in the system, and use the command 'CUDA_VISIBLE_DEVICES=6,7 python -u run.py --config-name config_mydataset_TIP exp_name=pretrain' (I guess this will be your default usage). The program seems to run successfully. But the speed seems too slow, as I extend the framework to 3D volume data, may I ask when you train it on your dataset, how long will each iteration cost?

siyi-wind commented 1 month ago

Hi, I did not set 'NCCL_P2P_DISABLE=1' before training, and I have run our code on different servers and gpus. It might because the implicit problem between pytorch-lightning and gpus. You could try to upgrade pytorch-lightining and adjust the code to address conflictions.

Besides, we used two A5000 GPUs and set the number of workers to 10-12. The average time per epoch is around 5-7 minutes. Please note that we used Albumentations to accelerate the data augmentation process. It will be much slower if you load data from formats other than NumPy."

siyi-wind / TIP

Issues about training with multi-gpus #3

distributed_backend=nccl All distributed processes registered. Starting with 2 processes