Open fyb99 opened 1 month ago
Hi, thanks for the interest to our paper. Yes, our code supports multiple gpus. Did you use the same environment as mentioned in the README? Can you share the error message?
Yes, I use the same environment. I use the command 'CUDA_VISIBLE_DEVICES=6,7 python -u run.py --config-name config_mydataset_TIP exp_name=pretrain'
The progress is stuck after printing the model architecture, full printing information is listed below,
Trainer(limit_train_batches=1.0)
was configured so 100% of the batches per epoch will be used..
Trainer(limit_val_batches=1.0)
was configured so 100% of the batches will be used..
Global seed set to 2022
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Global seed set to 2022
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2/mnt/data2/xx/anaconda3/envs/tip/lib/python3.9/site-packages/pytorch_lightning/trainer/configuration_validator.py:364: LightningDeprecationWarning: SSLOnlineEvaluator.on_load_checkpoint
will change its signature and behavior in v1.8. If you wish to load the state of the callback, use load_state_dict
instead. In v1.8 on_load_checkpoint(..., checkpoint)
will receive the entire loaded checkpoint dictionary instead of callback state.
rank_zero_deprecation(
Hi, what kind of gpus are you using? Do your gpus support the DDP training?
I've rerun my code and make sure it can support multiple GPUs. Could you use some Debug tools to see where the stucking position is?
Actually, I am using A6000 with the progress. It supports DDP.
After debugging, I find the code is stucking with the trainer definition '' trainer = Trainer.from_argparse_args(hparams, gpus=cuda.device_count(), callbacks=callbacks, logger=wandb_logger, max_epochs=hparams.max_epochs, check_val_every_n_epoch=hparams.check_val_every_n_epoch, limit_train_batches=hparams.limit_train_batches, limit_val_batches=hparams.limit_val_batches, enable_progress_bar=hparams.enable_progress_bar,)''
If I add parameter strategy='dp' to the function, like trainer = Trainer.from_argparse_args(hparams, gpus=cuda.device_count(), callbacks=callbacks, logger=wandb_logger, max_epochs=hparams.max_epochs, check_val_every_n_epoch=hparams.check_val_every_n_epoch, limit_train_batches=hparams.limit_train_batches, limit_val_batches=hparams.limit_val_batches, enable_progress_bar=hparams.enable_progress_bar, strategy='dp'), it can successfully run to the training stage. But the progress bar is stucking like the attached plot.
If I add parameter strategy='ddp', it raises another problem.
I guess the environment is same with your project, since I install it from the yaml file.
Hi, I guess this might because the unknown issue of pytorch-lightning. You could try to upgrade its version. Besides, have you reproduced MMCL? We used the same environment, but I utilized Albumentation to accelerate the training process.
Yes, I have reproduced MMCL, I can successfully run it with one gpu, currently I have not try to train MMCL with multi gpus. I have upgrade the version of pytorch-lightning, but it raises other errors before training, which seems to be caused by conflictions between various versions
When I do not set addtional strategy to the trainer, and add ‘’export NCCL_P2P_DISABLE=1 ‘’ in the system, and use the command 'CUDA_VISIBLE_DEVICES=6,7 python -u run.py --config-name config_mydataset_TIP exp_name=pretrain' (I guess this will be your default usage). The program seems to run successfully. But the speed seems too slow, as I extend the framework to 3D volume data, may I ask when you train it on your dataset, how long will each iteration cost?
Hi, I did not set 'NCCL_P2P_DISABLE=1' before training, and I have run our code on different servers and gpus. It might because the implicit problem between pytorch-lightning and gpus. You could try to upgrade pytorch-lightining and adjust the code to address conflictions.
Besides, we used two A5000 GPUs and set the number of workers to 10-12. The average time per epoch is around 5-7 minutes. Please note that we used Albumentations to accelerate the data augmentation process. It will be much slower if you load data from formats other than NumPy."
Hi, dear siyi, appreciate it a lot for your solid work. I have a question when trasnferring your code to other datasets. Does your code supports training with multi-gpus, when I try to use 2 gpus, the program seems to be stuck.