OOM error after model have been running 34 epochs

vkola-lab / tmi2022

A graph-transformer for whole slide image classification

MIT License

142 stars 34 forks source link

OOM error after model have been running 34 epochs #8

Closed ChenYuhang243 closed 3 months ago

ChenYuhang243 commented 1 year ago

Currently running this model on Ubuntu 18.04 on a device with 4 v100 GPUs, each GPU has 16G RAM. Batchsize set as 8, then lower to 4, but always encounter a cuda oom error after 30-40 epochs. Methods tried:

lower the batch size
total loss changed into total loss+=float(loss.detach().cpu().numpy())
delete intermediate variables followed by torch.cuda.empty_cache() but the error still exists. Desperately searching for a solution, any advice or hint is appreciated!

nabilapuspit commented 1 year ago

Hi, can you tell me how you organize the data for your training. I'm a bit confuse when its going to

--data_path "path_to_graph_data" \ #10

GSWS commented 3 months ago

I'm not certain which dataset you were using, but different sizes of WSIs will generate varying numbers of patches. This may cause the memory issue.
We found that there is no memory issue with TCGA-NSCL and CPTAC, and some work [1] used batch equal to 2 for CAMELYON16. And you could try batch equal to 2.

[1] Fourkioti et.al., CAMIL: Context-Aware Multiple Instance Learning for Cancer Detection and Subtyping in Whole Slide Images, ICLR 2024