Open honeytidy opened 2 years ago
Hi, honeytidy,
I run the deblurring demo (720x1280) in Replicate / Colab, and it seems no problem. The 'out of GPU memory' might cause by the large image size?
I tried it again, and it seemed all right. But it doesn't seem to work well with my test image. I paste the result for Replicate demo:
Hi, honeytidy,
Please try the test image in our colab demo It might get better results.
I'm working on the replicate demo but it may take some time.
@mayorx
Hi, i guess its better to ask in this "issue" I am having the same issue with CUDA out of memory. I am hundred percent sure it is caused by large image size. I have tried small resolution images and they work fine.
But I have a server with 4 x T4 GPUs and i have adjusted num_gpu in the yml file. num_gpu: 4 # set num_gpu: 0 for cpu mode
But seems still only one GPU is used. Anything else i need to adjust?
Thank you so much for your help!
Hi, @xiaohulihutu, For single image inference, using only one GPU is what we expected. May I ask about the image size in this "cuda out of memory" case?
A workaround is to crop the image into patches, restore each patch, and then stitch the patches into a whole image.
It could be accomplished in this framework by modifying the testing config:
1). switch the grids from false
into true
2). add two parameters, crop_size_w, crop_size_h after the parameter grids
, it may look like
val:
save_img: true
grids: true
crop_size_h: 512
crop_size_w: 512
.... (other parameters, e.g. metrics)
@mayorx Thank you for your fast response. The image reso is 3024*4032 It is taken by an iphone.
I will try your crop to patches method and see how it works. Thank you very much!
Hi, @xiaohulihutu, For single image inference, using only one GPU is what we expected. May I ask about the image size in this "cuda out of memory" case?
A workaround is to crop the image into patches, restore each patch, and then stitch the patches into a whole image. It could be accomplished in this framework by modifying the testing config: 1). switch the grids from
false
intotrue
2). add two parameters, crop_size_w, crop_size_h after the parametergrids
, it may look likeval: save_img: true grids: true crop_size_h: 512 crop_size_w: 512 .... (other parameters, e.g. metrics)
Got the exact same issue. Image size: 6680 x 4441
Disable distributed.
load net keys <built-in method keys of dict object at 0x7fa4298c4900>
2022-12-09 18:04:27,421 INFO: Model [ImageRestorationModel] is created.
Traceback (most recent call last):
File "basicsr/demo.py", line 61, in <module>
main()
File "basicsr/demo.py", line 49, in main
model.test()
File "/home/xxx/NAFNet/basicsr/models/image_restoration_model.py", line 247, in test
pred = self.net_g(self.lq[i:j])
File "/home/xxx/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/NAFNet/basicsr/models/archs/NAFNet_arch.py", line 136, in forward
x = self.intro(inp)
File "/home/xxx/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/xxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/xxx/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.09 GiB (GPU 0; 2.00 GiB total capacity; 939.42 MiB already allocated; 0 bytes free; 958.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I tried the above. Got a new error:
Disable distributed.
load net keys <built-in method keys of dict object at 0x7fb2f818d6c0>
2022-12-09 18:05:53,378 INFO: Model [ImageRestorationModel] is created.
Traceback (most recent call last):
File "basicsr/demo.py", line 61, in <module>
main()
File "basicsr/demo.py", line 47, in main
model.grids()
File "/home/xxx/NAFNet/basicsr/models/image_restoration_model.py", line 110, in grids
b, c, h, w = self.gt.size()
AttributeError: 'ImageRestorationModel' object has no attribute 'gt'
run into the same problem.
after adding
crop_size_h: 256
crop_size_w: 256
I got this error
Traceback (most recent call last):
File "basicsr/test.py", line 70, in <module>
main()
File "basicsr/test.py", line 61, in main
model.validation(
File "basicsr/models/base_model.py", line 55, in validation
return self.dist_validation(dataloader, current_iter, tb_logger, save_img, rgb2bgr, use_image)
File "basicsr/models/image_restoration_model.py", line 285, in dist_validation
self.grids_inverse()
File "basicsr/models/image_restoration_model.py", line 184, in grids_inverse
preds[0, :, i: i + crop_size_h, j: j + crop_size_w] += self.outs[cnt]
AttributeError: 'ImageRestorationModel' object has no attribute 'outs'
coudn't find any outs except in model.test
def test(self):
self.net_g.eval()
with torch.no_grad():
n = len(self.lq)
outs = []
m = self.opt['val'].get('max_minibatch', n)
i = 0
while i < n:
j = i + m
if j >= n:
j = n
pred = self.net_g(self.lq[i:j])
if isinstance(pred, list):
pred = pred[-1]
outs.append(pred.detach().cpu())
i = j
self.output = torch.cat(outs, dim=0)
self.net_g.train()
after changing to
def test(self):
self.net_g.eval()
with torch.no_grad():
n = len(self.lq)
self.outs = []
m = self.opt['val'].get('max_minibatch', n)
i = 0
while i < n:
j = i + m
if j >= n:
j = n
pred = self.net_g(self.lq[i:j])
if isinstance(pred, list):
pred = pred[-1]
self.outs.append(pred.detach().cpu())
i = j
self.output = torch.cat(self.outs, dim=0)
self.net_g.train()
I got this error
Traceback (most recent call last):
File "basicsr/test.py", line 70, in <module>
main()
File "basicsr/test.py", line 61, in main
model.validation(
File "basicsr/models/base_model.py", line 55, in validation
return self.dist_validation(dataloader, current_iter, tb_logger, save_img, rgb2bgr, use_image)
File "basicsr/models/image_restoration_model.py", line 285, in dist_validation
self.grids_inverse()
File "basicsr/models/image_restoration_model.py", line 184, in grids_inverse
preds[0, :, i: i + crop_size_h, j: j + crop_size_w] += self.outs[cnt]
RuntimeError: output with shape [3, 256, 256] doesn't match the broadcast shape [4, 3, 256, 256]
I have to admit I have no idea what I'm doing here - Any help would be greatly appreciated.
@mexthecat It works by those changs:
1.change NAFNet-width32.yml
val:
save_img: true
grids: true
crop_size_h: 512
crop_size_w: 512
2.add 'gt' in basicsr\demo.py at line 44
model.feed_data(data={'lq': img.unsqueeze(dim=0),'gt': img.unsqueeze(dim=0)})
3.change 'out' to 'output' in basicsr\models\image_restoration_model.py at line 183
preds[0, :, i: i + crop_size_h, j: j + crop_size_w] += self.output[cnt]
replace m by a smaller value
def test(self):
self.net_g.eval()
with torch.no_grad():
n = len(self.lq)
outs = []
m = self.opt['val'].get('max_minibatch', n)
m = 1 #set m here
i = 0
while i < n:
j = i + m
if j >= n:
j = n
pred = self.net_g(self.lq[i:j])
if isinstance(pred, list):
pred = pred[-1]
outs.append(pred.detach().cpu())
i = j
self.output = torch.cat(outs, dim=0)
self.net_g.train()
If those modifications are still shows out of memory, try set this in ternimal, set a smaller max_split_size_mb set PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
Now it should be no problem.
got the following error when running the Replicate demo (Debluring):
BTW, colab get the same error.