tensorlayer / SRGAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
https://github.com/tensorlayer/tensorlayerx
3.33k stars 813 forks source link

Epoch stuck at 0 #175

Open Teragron opened 5 years ago

lunayang712 commented 5 years ago

@Teragron for train you need both train_LR and train_HR file. image

Teragron commented 5 years ago

@lunayang712 thanks a lot, so i need 5 folder in DIV2K right?

lunayang712 commented 5 years ago

yeah. good luck! @Teragron

lunayang712 commented 5 years ago

check the lr img path? or maybe the epoch, batch size setting parameters are going wrong.

mcDandy commented 5 years ago

It will change the epoch. When depending on size of dataset. this is with batch_size=16 and output image array set to 4x4; images = 1024x768 count: 1658

Epoch: [0/100] step: [100/6] time: 31.102s, mse: 0.066 Epoch: [0/100] step: [101/6] time: 31.113s, mse: 0.051 Epoch: [0/100] step: [102/6] time: 31.199s, mse: 0.095 Epoch: [1/100] step: [0/6] time: 33.887s, mse: 0.060 Epoch: [1/100] step: [1/6] time: 31.681s, mse: 0.065 Epoch: [1/100] step: [2/6] time: 34.661s, mse: 0.064 Epoch: [1/100] step: [3/6] time: 32.734s, mse: 0.062 Epoch: [1/100] step: [4/6] time: 33.832s, mse: 0.075 Epoch: [1/100] step: [5/6] time: 31.289s, mse: 0.046 Epoch: [1/100] step: [6/6] time: 31.343s, mse: 0.038 Epoch: [1/100] step: [7/6] time: 33.424s, mse: 0.045 Epoch: [1/100] step: [8/6] time: 31.506s, mse: 0.047 Epoch: [1/100] step: [9/6] time: 34.266s, mse: 0.059 Epoch: [1/100] step: [10/6] time: 35.366s, mse: 0.036

mcDandy commented 5 years ago

this is strange too

Epoch: [2/1] step: [70/125] time: 198.491s, g_loss(mse:0.032, vgg:0.051, adv:0.007) d_loss: 1.294 Epoch: [2/1] step: [71/125] time: 219.722s, g_loss(mse:0.033, vgg:0.054, adv:0.004) d_loss: 0.420 Epoch: [2/1] step: [72/125] time: 221.830s, g_loss(mse:0.024, vgg:0.042, adv:0.005) d_loss: 0.463 Epoch: [3/1] step: [0/125] time: 213.389s, g_loss(mse:0.032, vgg:0.036, adv:0.004) d_loss: 0.400 Epoch: [3/1] step: [1/125] time: 234.970s, g_loss(mse:0.033, vgg:0.031, adv:0.010) d_loss: 0.902

zsdonghao commented 5 years ago

why the training is so slow.. did you use GPU?

mcDandy commented 5 years ago

Data set info: 1 176 images at 1920x1080 CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz GPU: NVIDIA Quadro P1000

I have installed tensorflow-gpu. Does not apper to. at least in first phase Epoch: [0/1] step: [3/0] time: 42.531s, mse: 0.325 Sometímes some sparks of activity on cuda cores lasting about 1 min. Does not apper to have effect on step time. Memory currently 35GB / 16 GB

Epoch: [0/1] step: [71/0] time: 38.153s, mse: 0.037 Epoch: [0/1] step: [72/0] time: 38.961s, mse: 0.050 Epoch: [0/1] step: [0/125] time: 226.178s, g_loss(mse:0.042, vgg:0.044, adv:0.000) d_loss: 3.124 Epoch: [0/1] step: [1/125] time: 230.655s, g_loss(mse:0.054, vgg:0.061, adv:0.004) d_loss: 4.261 Memory is more than 200% of on board memory. Sparks of activity on gpu are continuing

2019-10-28 (11) 2019-10-28 (14) 2019-10-28 (15) 2019-10-28 (16)

mcDandy commented 5 years ago

How do I train on GPU? I restarted training, becouse I changed epoch count to 10 so the model saves. GPU is doing basicly nothing.

mcDandy commented 5 years ago

Batch size was 36 (Max I can go)

Epoch: [9/10] step: [31/0] time: 148.904s, mse: 0.033 2019-10-29 18:59:18.131505: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at conv_ops.cc:501 : Resource exhausted: OOM when allocating tensor with shape[36,48,48,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu Traceback (most recent call last): File "train.py", line 202, in train() File "train.py", line 116, in train logits_real = D(hr_patchs) File "C:\Program Files\Python37\lib\site-packages\tensorlayer\models\core.py", line 295, in call return self.forward(inputs, kwargs) File "C:\Program Files\Python37\lib\site-packages\tensorlayer\models\core.py", line 338, in forward memory[node.name] = node(node_input) File "C:\Program Files\Python37\lib\site-packages\tensorlayer\layers\core.py", line 433, in call outputs = self.layer.forward(inputs, kwargs) File "C:\Program Files\Python37\lib\site-packages\tensorlayer\layers\convolution\simplified_conv.py", line 271, in forward name=self.name, File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 1913, in conv2d_v2 name=name) File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 2010, in conv2d name=name) File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 1038, in conv2d _six.raise_from(_core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[36,48,48,256] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:Conv2D] name: conv2d_40

Sam-JungSoonWoo commented 4 years ago

@mcDandy you should do CUDA setting to use GPU.

search : tensorflow check use gpu ( tf.test.is_gpu_available | TensorFlow Core v2.3.0 )

if u use AWS, u should use AMI for deep learning