how can i train the mae huge model? - Githubissues

uncbiag / SimpleClick

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers (ICCV 2023)

MIT License

214 stars 33 forks source link

how can i train the mae huge model? #3

Closed senlin-ali closed 2 years ago

senlin-ali commented 2 years ago

hi、how can i train the mae huge model?

i use 8 gpus batchsize 1 RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.59 GiB already allocated; 5.44 MiB free; 9.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

senlin-ali commented 2 years ago

it seems that one gpu can not load the huge model、but i use CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 it can not work

qinliuliuqin commented 2 years ago

I trained the huge models on 4 A6000 GPUs (48G each) with batch size 32; the training occupied all the memory. If your GPU memory is only 12G, I am not sure if it's enough to train the huge model, though the batch size is set to1. There is no meaning to use multiple GPUs with batch size set to 1. I will get back to you once I get time to test the training memory. BTW, can you train the base or large model on your GPU?

senlin-ali commented 2 years ago

I trained the huge models on 4 A6000 GPUs (48G each) with batch size 32; the training occupied all the memory. If your GPU memory is only 12G, I am not sure if it's enough to train the huge model, though the batch size is set to1. There is no meaning to use multiple GPUs with batch size set to 1. I will get back to you once I get time to test the training memory. BTW, can you train the base or large model on your GPU?

yeah、 i can train base model ,maybe you are right , my gpu memory is only 12g

senlin-ali commented 2 years ago

I have another question、i use the pretrain base model on cocolivis to finetune my own dataset(2000 images \one class )、it's performances bad and even 20 noc@0.85、( only change the dataset path and model path )

qinliuliuqin commented 2 years ago

There must be a bug because 20 NoC@85% means no image can be segmented within 20 clicks to achieve 85% IoU. You can visualize the segmented images by adding "--vis-preds" in the evaluation script. I would suggest you double-check if the dataset prepares your images correctly.

qinliuliuqin commented 2 years ago

I just trained ViT-Huge with a single A6000 GPU (48G), and monitored the GPU memory consumption using "nvidia-smi". batch size = 1, GPU memory consumption = 16073 M; batch size = 2, GPU memory consumption = 19917 M; batch size = 4, GPU memory consumption = 27675 M; batch size = 8, GPU memory consumption = 43087 M; In our experiments, we used 4 A6000 GPUs with batch size=32, so each GPU took 8 batches.

xbkaishui commented 1 year ago

hi how long did you train the model ?

qinliuliuqin commented 1 year ago

Hi, it depends on the dataset and the model. If you train ViT-B on SBD for 55 epochs following the settings claimed in the paper, it took less than 4 hours.

xbkaishui commented 1 year ago

great, thanks