Multi-gpu training - Githubissues

rlqja1107 / torch-LLM4SGG

Official PyTorch implementation Source code for LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation, accepted at CVPR 2024

84 stars 3 forks source link

Multi-gpu training #4

Open slcheng97 opened 8 months ago

slcheng97 commented 8 months ago

I find your code only supprots 1 GPU traning, can we run it with multiple GPUs?

rlqja1107 commented 8 months ago

Thank you for your interest in my work!

Initially, I developed my code using VS3, which was designed to operate across multiple GPUs. Despite this, when I attempted to execute the python code provided below to run it with multiple GPUs, I encountered difficulties running the code on multiple GPUs. I suspect the issue may stem from either installed packages' version or compatibility of the GPU environment (NVIDIA RTX A6000).

As a suggestion, I recommend attempting to execute it using the python code provided by VS3. Or, you may run my code with multiple GPUs by appending python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS within our shell code.

Training code of VS3

slcheng97 commented 8 months ago

How much time is required to train your code on a GPU? I'm presently executing your training code on a RTX 3090 and have observed that it consumes 11.5 days...

rlqja1107 commented 7 months ago

Even with just one GPU, training is completed within a day (or Maximum 2 days). Oh... It's strange that it takes 11.5 days for training. Or, can you change the SOLVER.VAL_MIN_ITERATION variable to 30,000? This variable represents the number of iterations tested on the validation set for the first time. Starting from this point, testing on the validation set is conducted every 5000 iterations. By modifying this variable, you can evaluate the model at an earlier stage.

As shown in below picture, it took less than a day to train for 100,000 iterations on my GPU device.

slcheng97 commented 7 months ago

Thank you for your response, @rlqja1107. I've attempted several times to execute the command bash train_vg.sh, and here is the training status:

However, the tranining time is still too long.... Would you mind sharing your trainig log for the commandbash train_vg.sh?

rlqja1107 commented 7 months ago

Oh, is the prolonged duration that you mentioned due to the ETA displayed in the log? The ETA is determined by SOLVER.MAX_EPOCH. By default, it's configured for 15 epochs, hence the ETA seems to indicate 11 days, as shown in above figure. However, training is actually completed before that. When i set SOLVER.MAX_EPOCH to 3 epochs, making iterations 180,000, the ETA shifts, as shown in below figure.

However, it's important to note that I terminated the training before 150,000 iterations. So, you may terminate the training at early iteration even if you set SOLVER.MAX_EPOCH to 15 epochs.

I apologize for confusing the training time.

slcheng97 commented 7 months ago

I find 2 yaml files where I can adjust training epochs: configs/vg150/finetune_VG.yaml

configs/pretrain/glip_Swin_T_O365_GoldG.yaml

By the way, if I adjust the number of epochs from 30 to 15, do I need to adjust the learning rate accordingly? It would be greatly appreciated if you could share the actual training configuration file that was used during the training process. Thank you!

rlqja1107 commented 7 months ago

The former config is the file to change the training epoch. When you change the epoch from 30 to 15, you don't have to change the learning rate. Of course, this may affect performance, but it doesn't cause significant fluctuations in performance.

Regarding the actual training configuration file, i already upload the config file on the README.md, as shown in below figure.

Thank you. I hope it helps your understanding.

slcheng97 commented 7 months ago

Thank you for your assistance, it's truly helpful. Based on your previous suggestions, if I adjust SOLVER.MAX_EPOCH to 3 in configs/vg150/finetune_VG.yaml, would I be able to achieve the results reported in the paper?

rlqja1107 commented 7 months ago

Sorry for the late response.

When experimenting with a flexible range of SOLVER.MAX_EPOCH between 10 and 20, the experimental results did not change significantly. When setting the max epoch to 3, it cannot be guaranteed whether the performance reported in the paper can be reproduced. It may actually be lower or higher.

Therefore, I recommend setting it to 15 epochs. However, since this could be considered a hyperparameter, it might be worth running experiments with 3 epochs for analytical purposes.