OOM on eval - Githubissues

evolu8 commented 2 years ago

I get the following running on a 16Gb card. Tried reducing setting values in the eval config. No joy.

 "The default behavior for interpolate/upsample with float scale_factor changed "
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1088/1088 [04:06<00:00,  4.41it/s]
Error executing job with overrides: []
Traceback (most recent call last):
  File "eval_segmentation.py", line 144, in my_app
    outputs = {k: torch.cat(v, dim=0) for k, v in outputs.items()}
  File "eval_segmentation.py", line 144, in <dictcomp>
    outputs = {k: torch.cat(v, dim=0) for k, v in outputs.items()}
RuntimeError: [enforce fail at CPUAllocator.cpp:68] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 24053760000 bytes. Error code 12 (Cannot allocate memory)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

mhamilton723 commented 2 years ago

@evolu8 this should be fixed in master!

chetanyaabhinav commented 2 years ago

I am getting the same error while executing eval_segmentation.py ..

axkoenig commented 1 year ago

Me too, with all the default settings of eval_segmentation.py I am getting OOM after about 12 mins (at approx. 70% of the COCO val set). Any hints?

Process memory quickly rises to 40GB Screen Shot 2023-01-12 at 12 04 09

GPU memory looks good Screen Shot 2023-01-12 at 12 04 37

axkoenig commented 1 year ago

Ah, now I see. The code first computes all STEGO outputs over all images and stores them in memory, and then CRF is done - which overflows my 40GB RAM.

axkoenig commented 1 year ago

another learning is that there are two branches: one called "main" the other "master". Master has a version of eval_segmentation.py that requires less memory

mhamilton723 / STEGO

OOM on eval #1