Open slcheng97 opened 7 months ago
Thank you for your interest in my work!
Initially, I developed my code using VS3, which was designed to operate across multiple GPUs. Despite this, when I attempted to execute the python code provided below to run it with multiple GPUs, I encountered difficulties running the code on multiple GPUs. I suspect the issue may stem from either installed packages' version or compatibility of the GPU environment (NVIDIA RTX A6000).
As a suggestion, I recommend attempting to execute it using the python code provided by VS3.
Or, you may run my code with multiple GPUs by appending python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS
within our shell code.
How much time is required to train your code on a GPU? I'm presently executing your training code on a RTX 3090 and have observed that it consumes 11.5 days...
Even with just one GPU, training is completed within a day (or Maximum 2 days). Oh... It's strange that it takes 11.5 days for training.
Or, can you change the SOLVER.VAL_MIN_ITERATION
variable to 30,000? This variable represents the number of iterations tested on the validation set for the first time. Starting from this point, testing on the validation set is conducted every 5000 iterations.
By modifying this variable, you can evaluate the model at an earlier stage.
As shown in below picture, it took less than a day to train for 100,000 iterations on my GPU device.
Thank you for your response, @rlqja1107. I've attempted several times to execute the command bash train_vg.sh, and here is the training status:
However, the tranining time is still too long....
Would you mind sharing your trainig log for the commandbash train_vg.sh
?
Oh, is the prolonged duration that you mentioned due to the ETA displayed in the log?
The ETA is determined by SOLVER.MAX_EPOCH
. By default, it's configured for 15 epochs, hence the ETA seems to indicate 11 days, as shown in above figure. However, training is actually completed before that.
When i set SOLVER.MAX_EPOCH
to 3 epochs, making iterations 180,000, the ETA shifts, as shown in below figure.
However, it's important to note that I terminated the training before 150,000 iterations. So, you may terminate the training at early iteration even if you set SOLVER.MAX_EPOCH
to 15 epochs.
I apologize for confusing the training time.
I find 2 yaml files where I can adjust training epochs:
configs/vg150/finetune_VG.yaml
configs/pretrain/glip_Swin_T_O365_GoldG.yaml
By the way, if I adjust the number of epochs from 30 to 15, do I need to adjust the learning rate accordingly? It would be greatly appreciated if you could share the actual training configuration file that was used during the training process. Thank you!
The former config is the file to change the training epoch. When you change the epoch from 30 to 15, you don't have to change the learning rate. Of course, this may affect performance, but it doesn't cause significant fluctuations in performance.
Regarding the actual training configuration file, i already upload the config file on the README.md, as shown in below figure.
Thank you. I hope it helps your understanding.
Thank you for your assistance, it's truly helpful. Based on your previous suggestions, if I adjust SOLVER.MAX_EPOCH to 3 in configs/vg150/finetune_VG.yaml
, would I be able to achieve the results reported in the paper?
Sorry for the late response.
When experimenting with a flexible range of SOLVER.MAX_EPOCH
between 10 and 20, the experimental results did not change significantly. When setting the max epoch to 3, it cannot be guaranteed whether the performance reported in the paper can be reproduced. It may actually be lower or higher.
Therefore, I recommend setting it to 15 epochs. However, since this could be considered a hyperparameter, it might be worth running experiments with 3 epochs for analytical purposes.
I find your code only supprots 1 GPU traning, can we run it with multiple GPUs?