uncbiag / SimpleClick

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers (ICCV 2023)
MIT License
209 stars 32 forks source link

how can i train the mae huge model? #3

Closed senlin-ali closed 1 year ago

senlin-ali commented 1 year ago

hi、how can i train the mae huge model?

i use 8 gpus batchsize 1 RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.59 GiB already allocated; 5.44 MiB free; 9.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

senlin-ali commented 1 year ago

it seems that one gpu can not load the huge model、but i use CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 it can not work

qinliuliuqin commented 1 year ago

I trained the huge models on 4 A6000 GPUs (48G each) with batch size 32; the training occupied all the memory. If your GPU memory is only 12G, I am not sure if it's enough to train the huge model, though the batch size is set to1. There is no meaning to use multiple GPUs with batch size set to 1. I will get back to you once I get time to test the training memory. BTW, can you train the base or large model on your GPU?

senlin-ali commented 1 year ago

I trained the huge models on 4 A6000 GPUs (48G each) with batch size 32; the training occupied all the memory. If your GPU memory is only 12G, I am not sure if it's enough to train the huge model, though the batch size is set to1. There is no meaning to use multiple GPUs with batch size set to 1. I will get back to you once I get time to test the training memory. BTW, can you train the base or large model on your GPU?

yeah、 i can train base model ,maybe you are right , my gpu memory is only 12g

senlin-ali commented 1 year ago

I have another question、i use the pretrain base model on cocolivis to finetune my own dataset(2000 images \one class )、it's performances bad and even 20 noc@0.85、( only change the dataset path and model path )

qinliuliuqin commented 1 year ago

There must be a bug because 20 NoC@85% means no image can be segmented within 20 clicks to achieve 85% IoU. You can visualize the segmented images by adding "--vis-preds" in the evaluation script. I would suggest you double-check if the dataset prepares your images correctly.

qinliuliuqin commented 1 year ago

I just trained ViT-Huge with a single A6000 GPU (48G), and monitored the GPU memory consumption using "nvidia-smi". batch size = 1, GPU memory consumption = 16073 M; batch size = 2, GPU memory consumption = 19917 M; batch size = 4, GPU memory consumption = 27675 M; batch size = 8, GPU memory consumption = 43087 M; In our experiments, we used 4 A6000 GPUs with batch size=32, so each GPU took 8 batches.

xbkaishui commented 1 year ago

hi how long did you train the model ?

qinliuliuqin commented 1 year ago

Hi, it depends on the dataset and the model. If you train ViT-B on SBD for 55 epochs following the settings claimed in the paper, it took less than 4 hours.

xbkaishui commented 1 year ago

great, thanks