Cannot reproduce example outputs

kronion commented 8 years ago

The program seems not to work as intended even when I try to reproduce the example outputs.

python Network.py images/inputs/content/sagano_bamboo_forest.jpg images/inputs/style/patterned_leaves.jpg images/output/test3 --num_iter 50

Epoch 10: test3_at_iteration_10

Epoch 20: test3_at_iteration_20

Epoch 30: test3_at_iteration_30

Epoch 40: test3_at_iteration_40

Epoch 50: test3_at_iteration_50

I've tried running on Ubuntu 16.04 with an Nvidia GeForce GTX 670, and also OSX 10.10.4 CPU-only. Tensorflow backend in both cases. I see similar results in both cases.

titu1994 commented 8 years ago

@kronion Is your keras img_dim_ordering = "tf" ? Check in the .keras json file.

Also, this was on CPU? How much time did it actually take? I assume it took an enormous amount of time.

Edit: This looks like a case of using tensorflow backend with image_dim_ordering as "th". Because of this the theano weights are being loaded with tensorflow backend (theano convolutional kernels need to be flipped before being used in tensorflow, if it isn't flipped then this is usually the result).

kronion commented 8 years ago

{
    "floatx": "float32",
    "backend": "tensorflow",
    "epsilon": 1e-07,
    "image_dim_ordering": "tf"
}

on both machines.

And yes, it did take an enormous amount of time. I was able to do 25 epochs in about 6 hours, and I cut it off there.

kronion commented 8 years ago

Here are the logs from running ten epochs on my GPU. I thought the memory warnings could have something to do with the problems I'm experiencing, which I why I waited to reproduce them on a CPU before posting.

$ python Network.py images/inputs/content/sagano_bamboo_forest.jpg images/inputs/style/patterned_leaves.jpg images/output/test4

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce GTX 670
major: 3 minor: 0 memoryClockRate (GHz) 0.98
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.78GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 670, pci bus id: 0000:01:00.0)
Model loaded.
Start of iteration 1
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.11GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.11GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 1.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
Network.py:352: RuntimeWarning: invalid value encountered in double_scalars
  improvement = (prev_min_val - min_val) / prev_min_val * 100
Current loss value: 1.00896e+08  Improvement : nan %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_1.png
Iteration 1 completed in 23s
Start of iteration 2
Current loss value: 3.24734e+07  Improvement : 67.815 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_2.png
Iteration 2 completed in 21s
Start of iteration 3
Current loss value: 1.9547e+07  Improvement : 39.806 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_3.png
Iteration 3 completed in 20s
Start of iteration 4
Current loss value: 1.44546e+07  Improvement : 26.052 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_4.png
Iteration 4 completed in 21s
Start of iteration 5
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 21978 get requests, put_count=21977 evicted_count=1000 eviction_rate=0.0455021 and unsatisfied allocation rate=0.0500956
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
Current loss value: 1.18752e+07  Improvement : 17.845 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_5.png
Iteration 5 completed in 20s
Start of iteration 6
Current loss value: 9.70512e+06  Improvement : 18.274 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_6.png
Iteration 6 completed in 21s
Start of iteration 7
Current loss value: 8.33909e+06  Improvement : 14.075 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_7.png
Iteration 7 completed in 20s
Start of iteration 8
Current loss value: 7.34698e+06  Improvement : 11.897 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_8.png
Iteration 8 completed in 21s
Start of iteration 9
Current loss value: 6.64953e+06  Improvement : 9.493 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_9.png
Iteration 9 completed in 21s
Start of iteration 10
Current loss value: 5.85107e+06  Improvement : 12.008 %
Rescaling Image to (1080, 1920)
Image saved as images/output/test4_at_iteration_10.png
Iteration 10 completed in 21s

titu1994 commented 8 years ago

@kronion Hmm that is weird. The code does handle all the tensorflow differences properly (Network.py is same as the keras example, just with the variables exposed as arguments). See https://github.com/fchollet/keras/blob/master/examples/neural_style_transfer.py

Since I am on Windows, I can't use the Tensorflow backend to check. Since the original script has been tested on both, I assume Network.py should produce exact same results. It's loading the same weights and the same models as well so I don't understand whats going wrong.

Can I bother you to run the original script and see if the results are still wrong?

kronion commented 8 years ago

Weird, using Theano worked. So maybe TF support is buggy in the original implementation? Perhaps this line is the hint, I don't see it when I run with Theano:

Network.py:352: RuntimeWarning: invalid value encountered in double_scalars

That or I'm not installing Tensorflow correctly...

titu1994 commented 8 years ago

@kronion Have you tested it using the original script? That error comes after the image has already been created and the loss value has been returned. It's throwing an error during calculation of "improvement", which it doesn't throw when using theano.

titu1994 / Neural-Style-Transfer

Cannot reproduce example outputs #7