GPU memory is not released after training.

tianrun-chen / SAM-Adapter-PyTorch

Adapting Meta AI's Segment Anything to Downstream Tasks with Adapters and Prompts

MIT License

1k stars 89 forks source link

GPU memory is not released after training. #28

Closed uobinxiao closed 1 year ago

uobinxiao commented 1 year ago

I set batch size to 1, and the GPU usage during training is around 28192MiB. However, when the training finished and eval started, the memory usage would be double. Any way to fix this issue?

uobinxiao commented 1 year ago

I rewrite the code with hugging face accelerator. So I close this issue.

peacemo commented 1 year ago

Hi, I use 2 A30 GPUs each get 24GB memory (48GB in total), and I set all the train bs=2, val & test bs=1. The first epoch works fine, but it raises 'out of memory' error on the 2nd epoch. I'm curious about your solution with 'hugging face accelerator'. Could you please share more information about how you solve it?

uobinxiao commented 1 year ago

Hi, I used this lib https://huggingface.co/docs/accelerate/index, but I think a simple solution is just saving the weights for every epoch, and then evaluate them one by one.

Bingyang0410 commented 1 year ago

Hi, I used this lib https://huggingface.co/docs/accelerate/index, but I think a simple solution is just saving the weights for every epoch, and then evaluate them one by one.

Hello, I also have met this problem, can you explain this solution specifically? I saw this question when used accelerate: 'SequentialSampler' object has no attribute 'set_epoch' .

I felt confused why the memory is not enough after training.

Bingyang0410 commented 1 year ago

Hi, I used this lib https://huggingface.co/docs/accelerate/index, but I think a simple solution is just saving the weights for every epoch, and then evaluate them one by one.

Hello, I also have met this problem, can you explain this solution specifically? I saw this question when used accelerate: 'SequentialSampler' object has no attribute 'set_epoch' .

I felt confused why the memory is not enough after training.

It seems that I need to rewrite the code for the model train and dataset sections

Bill-Ren commented 1 year ago

I think the reason for this problem lies in the eval_psnr() function in train.py, which uses pre_list and gt_list during evaluation to save all the results of val_loader, resulting in memory overflow. My solution is to replace train.py with the code corresponding to eval_psnr in test.py.

Bill-Ren commented 1 year ago

the

Hi, I used this lib https://huggingface.co/docs/accelerate/index, but I think a simple solution is just saving the weights for every epoch, and then evaluate them one by one.

Hello, I also have met this problem, can you explain this solution specifically? I saw this question when used accelerate: 'SequentialSampler' object has no attribute 'set_epoch' . I felt confused why the memory is not enough after training.

It seems that I need to rewrite the code for the model train and dataset sections

I am not very familiar with this accelerator, can you tell me how to change it?

l1uj1awe103 commented 1 year ago

with torch.no_grad():