Getting Error when training model

ShiinaMitsuki commented 6 years ago

Hi there, I followed the instruction inthe README but got error as below:

(dcgan) [sobey123@localhost DCGAN-tensorflow]$ ./train.sh I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally {'batch_size': 64, 'beta1': 0.5, 'checkpoint_dir': 'checkpoint', 'crop': False, 'dataset': 'market', 'epoch': 100, 'input_fname_pattern': '.jpg', 'input_height': 128, 'input_width': None, 'learning_rate': 0.0002, 'options': 1, 'output_height': 256, 'output_path': 'duke_result', 'output_width': None, 'sample_dir': 'samples', 'sample_size': 1000, 'train': True, 'train_size': inf, 'unrolled_lstm': False, 'visualize': False} I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.683 pciBusID 0000:84:00.0 Total memory: 7.93GiB Free memory: 7.83GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:84:00.0) WARNING:tensorflow:From /home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py:109 in build_model.: histogram_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.histogram. Note that tf.summary.histogram uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on their scope. Traceback (most recent call last): File "main.py", line 103, in tf.app.run() File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "main.py", line 81, in main sample_dir=FLAGS.sample_dir) File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 89, in init self.build_model() File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 114, in buildmodel self.D, self.Dlogits = self.discriminator(self.G, self.y, reuse=True) File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 324, in discriminator h4 = linear(tf.reshape(h3, [self.batch_size, -1]), 1, 'd_h4_lin') File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/ops.py", line 98, in linear tf.random_normal_initializer(stddev=stddev)) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable custom_getter=custom_getter) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable custom_getter=custom_getter) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable validate_shape=validate_shape) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter caching_device=caching_device, validate_shape=validate_shape) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 637, in _get_single_variable found_var.get_shape())) ValueError: Trying to share variable discriminator/d_h4_lin/Matrix, but specified shape (131072, 1) and found shape (32768, 1). (dcgan) [sobey123@localhost DCGAN-tensorflow]$ vim train.sh (dcgan) [sobey123@localhost DCGAN-tensorflow]$ ./train.sh I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally {'batch_size': 64, 'beta1': 0.5, 'checkpoint_dir': 'checkpoint', 'crop': False, 'dataset': 'market', 'epoch': 25, 'input_fname_pattern': '.jpg', 'input_height': 108, 'input_width': None, 'learning_rate': 0.0002, 'options': 1, 'output_height': 64, 'output_path': 'duke_result', 'output_width': None, 'sample_dir': 'samples', 'sample_size': 1000, 'train': False, 'train_size': inf, 'unrolled_lstm': False, 'visualize': False} I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.683 pciBusID 0000:84:00.0 Total memory: 7.93GiB Free memory: 7.83GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:84:00.0) WARNING:tensorflow:From /home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py:109 in build_model.: histogram_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30. Instructions for updating: Please switch to tf.summary.histogram. Note that tf.summary.histogram uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on their scope. Traceback (most recent call last): File "main.py", line 103, in tf.app.run() File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "main.py", line 81, in main sample_dir=FLAGS.sample_dir) File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 89, in init self.build_model() File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 114, in buildmodel self.D, self.Dlogits = self.discriminator(self.G, self.y, reuse=True) File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/model.py", line 324, in discriminator h4 = linear(tf.reshape(h3, [self.batch_size, -1]), 1, 'd_h4_lin') File "/home/sobey123/code/project/Person-reid-GAN-pytorch/DCGAN-tensorflow/ops.py", line 98, in linear tf.random_normal_initializer(stddev=stddev)) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1024, in get_variable custom_getter=custom_getter) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 850, in get_variable custom_getter=custom_getter) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 346, in get_variable validate_shape=validate_shape) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 331, in _true_getter caching_device=caching_device, validate_shape=validate_shape) File "/home/sobey123/miniconda2/envs/dcgan/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 637, in _get_single_variable found_var.get_shape())) ValueError: Trying to share variable discriminator/d_h4_lin/Matrix, but specified shape (8192, 1) and found shape (25088, 1).

I just run the conda env create -f dcgan.yml command, activate virtualenv and then python main.py --dataset market --options 1

It seems this line of code causes the problem: model.py line 114 self.D_, self.Dlogits = self.discriminator(self.G, self.y, reuse=True)

why 2 discriminator? Many thanks in advance!

qiaoguan commented 6 years ago

hey, i alter the source code of main.py, just change the value of input_height and output_height to 128. and run the source code to see whether this problem can be solved?

ShiinaMitsuki commented 6 years ago

Problem solved, thanks for helping!! One more question, how long did it took for training the dcgan on market1501? I'm now on epoch 300, but the sample images are still poor, my d_loss is small and g_loss trends to be growing with the epoch goes on.

I'm unfimilar with GAN, but according to the loss function proposed by the paper:

it seems tha g_loss should be small and d_loss should be big, I doubt that 300 epochs may far from enough.

qiaoguan / Person-reid-GAN-pytorch

Getting Error when training model #6