OOM when allocating tensor with shape[1,256,678,1020] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Conv2D]

tensorlayer / SRGAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

https://github.com/tensorlayer/tensorlayerx

3.24k stars 813 forks source link

OOM when allocating tensor with shape[1,256,678,1020] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Conv2D] #248

Closed wm19999 closed 1 year ago

wm19999 commented 1 year ago

Here is a screenshot of my GPU situation during operation: The screenshot below is the details of the error: Think you very much

hanjr92 commented 1 year ago

网络的图像输入应该是NCHW，但是在代码中默认从文件中读取的图像是HWC，通道的转换是代码中处理的，你有做其他额外的操作或者代码的改动么。

wm19999 commented 1 year ago

目前就是对红框中的这些数据进行了修改，其他的未进行操作而且在截图的代码中，有一部分高亮，有一部分灰暗的这是什么原因呢？非常希望您能够进行解答，谢谢~

hanjr92 commented 1 year ago

我在我的环境下测试了 tensorflow后端，没有出现你的这个问题，OOM可能的原因：1、你的显存不够大，我是用的batchsize为8也是可以正常训练和测试。2、可以试着在代码import os之后加上：

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
for device in physical_devices:
    tf.config.experimental.set_memory_growth(device, True)

类似这样 tensorflow框架本身的机制有可能会有显存保留，但是我看你的这个shape[1,256,678,1020]这是在做推理的测试么。

wm19999 commented 1 year ago

嗯，好的谢谢~我尝试了一下还是会出现相同的错误，但是我将batch_size调小之后能够进行训练完成，但是该训练的结果是不能用的。在使用你们训练的模型尝试运行程序，直接在终端输入命令不能输出结果，但是对mode=eval单行运行时，能够正常输出和readme文件中相似的结果：这种情况的出现是什么原因呢？

hanjr92 commented 1 year ago

如果你使用了更小的batchsize，那么训练的epoch数量就需要相应的增加才行。使用命令行我这边也是能正常运行的，你是在windows下使用的命令行么。

wm19999 commented 1 year ago

对的，我是在windows下使用的命令行，您不是使用的windows吗？那是什么系统呢

hanjr92 commented 1 year ago

我在ubuntu下使用是可以的，没有在windows的命令行测试过，那你可以直接修改代码里面的参数来运行文件也挺方便的。

wm19999 commented 1 year ago

好的，谢谢您了~