tensorlayer / SRGAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
https://github.com/tensorlayer/tensorlayerx
3.29k stars 810 forks source link

problems running on windows #180

Closed mcDandy closed 4 years ago

mcDandy commented 4 years ago

python train.py

2019-10-08 20:55:32.978162: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll 2019-10-08 20:55:35.246078: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll 2019-10-08 20:55:35.333633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Quadro P1000 major: 6 minor: 1 memoryClockRate(GHz): 1.5185 pciBusID: 0000:01:00.0 2019-10-08 20:55:35.341328: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2019-10-08 20:55:35.348135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-10-08 20:55:35.351468: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2019-10-08 20:55:35.359874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: Quadro P1000 major: 6 minor: 1 memoryClockRate(GHz): 1.5185 pciBusID: 0000:01:00.0 2019-10-08 20:55:35.366853: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. 2019-10-08 20:55:35.372725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-10-08 20:55:36.070183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-10-08 20:55:36.075760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-10-08 20:55:36.079969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-10-08 20:55:36.084090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3005 MB memory) -> physical GPU (device: 0, name: Quadro P1000, pci bus id: 0000:01:00.0, compute capability: 6.1) 2019-10-08 20:55:36.279220: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2019-10-08 20:55:37.369250: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows Relying on driver to perform ptx compilation. This message will be only logged once. Traceback (most recent call last): File "train.py", line 202, in train() File "train.py", line 74, in train G = get_G((batch_size, 96, 96, 3)) File "D:\Users\\Downloads\srgan-master\model.py", line 27, in get_G n = BatchNorm(gamma_init=g_init)(n) NameError: name 'BatchNorm' is not defined

packages which I noticed installing/ needed to install

tensorboard 2.0.0 tensorflow-estimator 2.0.0 tensorflow-gpu 2.0.0 tensorlayer 2.1.0

Pillow 6.2.0 google-pasta 0.1.7 Lasagne 0.1 Markdown 3.1.1

pip 19.2.3 Python 3.7.4

Os: win10 CUDA computing toolkit 10.1 and 10.0 GPU Nvidia qadro p1000 CPU: Intel core I7 8750H

mcDandy commented 4 years ago

Is it becouse I use custom dataset of png files? (1920x1080)

mcDandy commented 4 years ago

i changed only config.TRAIN.hr_img_path to point into dataset

rophen2333 commented 4 years ago

i have same question

mcDandy commented 4 years ago

Changed line 27 n = BatchNorm(gamma_init=g_init)(n) to n = BatchNorm2d(gamma_init=g_init)(n) and updated tensorflow, but different problem arised

python train.py 2019-10-11 13:54:32.192366: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 100% (70151 of 70151) |##########################################################| Elapsed Time: 0:04:15 ETA: 00:00:00Traceback (most recent call last): File "train.py", line 202, in train() File "train.py", line 76, in train VGG = tl.models.vgg19(pretrained=True, end_with='pool4', mode='static') File "C:\Program Files\Python37\lib\site-packages\tensorlayer\models\vgg.py", line 317, in vgg19 restore_model(model, layer_type='vgg19') File "C:\Program Files\Python37\lib\site-packages\tensorlayer\models\vgg.py", line 169, in restore_model npz = np.load(os.path.join('models', model_saved_name[layer_type]), encoding='latin1').item() File "C:\Program Files\Python37\lib\site-packages\numpy\lib\npyio.py", line 447, in load pickle_kwargs=pickle_kwargs) File "C:\Program Files\Python37\lib\site-packages\numpy\lib\format.py", line 696, in read_array raise ValueError("Object arrays cannot be loaded when " ValueError: Object arrays cannot be loaded when allow_pickle=False

sintaxed commented 4 years ago

Hey I fixed the Batchnorm problem by adding BatchNorm to the imports at the beginning

from tensorlayer.layers import (Input, Conv2d, BatchNorm2d, Elementwise, SubpixelConv2d, Flatten, Dense) -> from tensorlayer.layers import (Input, Conv2d, BatchNorm, BatchNorm2d, Elementwise, SubpixelConv2d, Flatten, Dense)

You should be able to fix the pickle error by using an older version of numpy==1.16.1

mcDandy commented 4 years ago

I am using tensorfow 2.0.0 and numpy 1.16.5. Will try it.

mcDandy commented 4 years ago

fixed and downgraded numpy

the error is strange, the variable was set in scope few lines on top.

after python train.py 2019-10-18 08:50:28.953029: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 Traceback (most recent call last): File "train.py", line 202, in train() File "train.py", line 104, in train tl.vis.save_images(fake_hr_patchs.numpy(), [2, 4], os.path.join(save_dir, 'train_ginit{}.png'.format(epoch))) UnboundLocalError: local variable 'fake_hr_patchs' referenced before assignment

mcDandy commented 4 years ago

Dataset of 2 images is too small. small fraction of my dataset (75/about 25k) is enouch to run. Ignore the error above is just becouse too smal dataset.

mcDandy commented 4 years ago

Different error might still happen whet training finishes. I stopped it by ctrl+C becouse running on battery.

... Epoch: [1/100] step: [2/6] time: 62.833s, mse: 0.499 Traceback (most recent call last): File "train.py", line 202, in train() File "train.py", line 99, in train grad = tape.gradient(mse_loss, G.trainable_weights) File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\backprop.py", line 1014, in gradient unconnected_gradients=unconnected_gradients) File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\imperative_grad.py", line 76, in imperative_grad compat.as_str(unconnected_gradients.value)) File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\eager\backprop.py", line 601, in _aggregate_grads return gen_math_ops.add_n(gradients) File "C:\Program Files\Python37\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py", line 455, in add_n name, _ctx._post_execution_callbacks, inputs) KeyboardInterrupt

mcDandy commented 4 years ago

Should have closed this issue and create new. This is no longer problem of starting, but training/post training it.