Training fails - Githubissues

evolu8 commented 2 years ago

Running training results in the following:

Error executing job with overrides: []
Traceback (most recent call last):
  File "train_segmentation.py", line 532, in my_app
    pos_labels=True
  File "/home/ec2-user/SageMaker/STEGO/src/core.py", line 740, in __init__
    raise ValueError("could not find nn file {} please run precompute_knns".format(feature_cache_file))
ValueError: could not find nn file /home/ec2-user/SageMaker/data/nns/nns_vit_small_cocostuff27_train_None_224.npz please run precompute_knns

running precompute_knns then fails due potsdm not having been unzipped. I unzipped and ran again and it failed with OOM:

Error executing job with overrides: []
Traceback (most recent call last):
  File "precompute_knns.py", line 84, in my_app
    normed_feats = get_feats(par_model, loader)
  File "precompute_knns.py", line 24, in get_feats
    feats = F.normalize(model.forward(img.cuda()).mean([2, 3]), dim=1)
  File "/home/ec2-user/anaconda3/envs/stego/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/ec2-user/anaconda3/envs/stego/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/stego/lib/python3.6/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/ec2-user/anaconda3/envs/stego/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/STEGO/src/modules.py", line 93, in forward
    feat, attn, qkv = self.model.get_intermediate_feat(img, n=n)
  File "/home/ec2-user/SageMaker/STEGO/src/dino/vision_transformer.py", line 232, in get_intermediate_feat
    x,attn,qkv = blk(x, return_qkv=True)
  File "/home/ec2-user/anaconda3/envs/stego/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/STEGO/src/dino/vision_transformer.py", line 107, in forward
    y, attn, qkv = self.attn(self.norm1(x))
  File "/home/ec2-user/anaconda3/envs/stego/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/SageMaker/STEGO/src/dino/vision_transformer.py", line 83, in forward
    attn = (q @ k.transpose(-2, -1)) * self.scale
RuntimeError: CUDA out of memory. Tried to allocate 3.53 GiB (GPU 0; 15.78 GiB total capacity; 9.58 GiB already allocated; 2.15 GiB free; 12.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

frank-xwang commented 2 years ago

I met the same issue. To resolve it, you can use a smaller batch size or more than one GPU.

evolu8 commented 2 years ago

Thanks @frank-xwang . I'm still digging into this a bit. Some of my images are of higher information density. This means that while stored as jpg they take up more disk space, despite being the same dimensions, channels and bit-depth (about twice the on disc size).

I'm surprised that this set needs running at proportionally smaller batch sizes. I can't see any other reason. But like I say, still digging...

But sure enough with a smaller batch size and 'dim' things run without CUDA OOM.

Shershebnev commented 2 years ago

I had the same issue with precompute_knns.py, solved by reducing the hardcoded batch size here https://github.com/mhamilton723/STEGO/blob/452ba7b65b441e1eee0a21a58b8c110b0bd72555/src/precompute_knns.py#L81 (256 -> smaller number)

mhamilton723 commented 1 year ago

Thank you @Shershebnev, @frank-xwang, and @evolu8 for your suggestions. Closing this out for now

mhamilton723 / STEGO

Training fails #2