Closed senlin-ali closed 2 years ago
it seems that one gpu can not load the huge model、but i use CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 it can not work
I trained the huge models on 4 A6000 GPUs (48G each) with batch size 32; the training occupied all the memory. If your GPU memory is only 12G, I am not sure if it's enough to train the huge model, though the batch size is set to1. There is no meaning to use multiple GPUs with batch size set to 1. I will get back to you once I get time to test the training memory. BTW, can you train the base or large model on your GPU?
I trained the huge models on 4 A6000 GPUs (48G each) with batch size 32; the training occupied all the memory. If your GPU memory is only 12G, I am not sure if it's enough to train the huge model, though the batch size is set to1. There is no meaning to use multiple GPUs with batch size set to 1. I will get back to you once I get time to test the training memory. BTW, can you train the base or large model on your GPU?
yeah、 i can train base model ,maybe you are right , my gpu memory is only 12g
I have another question、i use the pretrain base model on cocolivis to finetune my own dataset(2000 images \one class )、it's performances bad and even 20 noc@0.85、( only change the dataset path and model path )
There must be a bug because 20 NoC@85% means no image can be segmented within 20 clicks to achieve 85% IoU. You can visualize the segmented images by adding "--vis-preds" in the evaluation script. I would suggest you double-check if the dataset prepares your images correctly.
I just trained ViT-Huge with a single A6000 GPU (48G), and monitored the GPU memory consumption using "nvidia-smi". batch size = 1, GPU memory consumption = 16073 M; batch size = 2, GPU memory consumption = 19917 M; batch size = 4, GPU memory consumption = 27675 M; batch size = 8, GPU memory consumption = 43087 M; In our experiments, we used 4 A6000 GPUs with batch size=32, so each GPU took 8 batches.
hi how long did you train the model ?
Hi, it depends on the dataset and the model. If you train ViT-B on SBD for 55 epochs following the settings claimed in the paper, it took less than 4 hours.
great, thanks
hi、how can i train the mae huge model?
i use 8 gpus batchsize 1 RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.59 GiB already allocated; 5.44 MiB free; 9.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF