jcjohnson / neural-style

Torch implementation of neural style algorithm
MIT License
18.3k stars 2.71k forks source link

Loss values stay the same for every iteration, with extremely large image sizes #428

Open ProGamerGov opened 6 years ago

ProGamerGov commented 6 years ago

When using -image_size 2432, -image_size 2560, and -image_size 2816, with -backend cudnn, -optimizer adam, and -style_scale 0.5, the loss values seem to remain the same in every iteration. Lower image sizes don't seem to suffer from this issue.

I also used -gpu 0,1,2,3,4,5,6,7 -multigpu_strategy 2,3,4,6,8,11,12, which is the most efficient set of parameters for multiple GPUs that I have come across thus far.

ubuntu@ip-Address:~/neural-style$ ./multires_1.sh
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 538683157
Successfully loaded models/VGG16_SOD_finetune.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8-SOD100: 1 1 4096 100
Setting up style layer          2       :       relu1_1
Setting up style layer          7       :       relu2_1
Setting up style layer          12      :       relu3_1
Setting up style layer          19      :       relu4_1
Setting up content layer        21      :       relu4_2
Setting up style layer          26      :       relu5_1
Capturing content targets
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]
  (1): nn.GPU(1) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
  }
  (2): nn.GPU(2) @ nn.Sequential {
    [input -> (1) -> output]
    (1): nn.StyleLoss
  }
  (3): nn.GPU(3) @ nn.Sequential {
    [input -> (1) -> output]
    (1): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  }
  (4): nn.GPU(4) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.ReLU
    (2): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (5): nn.GPU(5) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
  }
  (6): nn.GPU(6) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.StyleLoss
    (2): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
    (3): cudnn.ReLU
  }
  (7): nn.GPU(7) @ nn.Sequential {
    [input -> (1) -> output]
    (1): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (8): nn.GPU(8) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
    (1): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
    (4): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU
    (6): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (7): cudnn.ReLU
    (8): cudnn.SpatialMaxPooling(2x2, 2,2)
    (9): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
    (10): cudnn.ReLU
    (11): nn.StyleLoss
    (12): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (13): cudnn.ReLU
    (14): nn.ContentLoss
    (15): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (16): cudnn.ReLU
    (17): cudnn.SpatialMaxPooling(2x2, 2,2)
    (18): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (19): cudnn.ReLU
    (20): nn.StyleLoss
  }
}
Capturing style target 1
Capturing style target 2
Capturing style target 3
Capturing style target 4
Capturing style target 5
Capturing style target 6
Capturing style target 7
Capturing style target 8
Running optimization with ADAM
Iteration 50 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 100 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 150 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 200 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 538683157
Successfully loaded models/VGG16_SOD_finetune.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8-SOD100: 1 1 4096 100
Setting up style layer          2       :       relu1_1
Setting up style layer          7       :       relu2_1
Setting up style layer          12      :       relu3_1
Setting up style layer          19      :       relu4_1
Setting up content layer        21      :       relu4_2
Setting up style layer          26      :       relu5_1
Capturing content targets
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]
  (1): nn.GPU(1) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
  }
  (2): nn.GPU(2) @ nn.Sequential {
    [input -> (1) -> output]
    (1): nn.StyleLoss
  }
  (3): nn.GPU(3) @ nn.Sequential {
    [input -> (1) -> output]
    (1): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  }
  (4): nn.GPU(4) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.ReLU
    (2): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (5): nn.GPU(5) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
  }
  (6): nn.GPU(6) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.StyleLoss
    (2): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
    (3): cudnn.ReLU
  }
  (7): nn.GPU(7) @ nn.Sequential {
    [input -> (1) -> output]
    (1): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (8): nn.GPU(8) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
    (1): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
    (4): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU
    (6): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (7): cudnn.ReLU
    (8): cudnn.SpatialMaxPooling(2x2, 2,2)
    (9): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
    (10): cudnn.ReLU
    (11): nn.StyleLoss
    (12): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (13): cudnn.ReLU
    (14): nn.ContentLoss
    (15): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (16): cudnn.ReLU
    (17): cudnn.SpatialMaxPooling(2x2, 2,2)
    (18): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (19): cudnn.ReLU
    (20): nn.StyleLoss
  }
}
Capturing style target 1
Capturing style target 2
Capturing style target 3
Capturing style target 4
Capturing style target 5
Capturing style target 6
Capturing style target 7
Capturing style target 8
Running optimization with ADAM
Iteration 50 / 200
  Content 1 loss: 1840350.585938
  Style 1 loss: 2566.359043
  Style 2 loss: 3547471.069336
  Style 3 loss: 5368391.235352
  Style 4 loss: 355980.445862
  Style 5 loss: 13842.927933
  Total loss: 11128602.623463
Iteration 100 / 200
  Content 1 loss: 1840350.585938
  Style 1 loss: 2566.359043
  Style 2 loss: 3547471.069336
  Style 3 loss: 5368391.235352
  Style 4 loss: 355980.445862
  Style 5 loss: 13842.927933
  Total loss: 11128602.623463
Iteration 150 / 200
  Content 1 loss: 1840350.585938
  Style 1 loss: 2566.359043
  Style 2 loss: 3547471.069336
  Style 3 loss: 5368391.235352
  Style 4 loss: 355980.445862
  Style 5 loss: 13842.927933
  Total loss: 11128602.623463
Iteration 200 / 200
  Content 1 loss: 1840350.585938
  Style 1 loss: 2566.359043
  Style 2 loss: 3547471.069336
  Style 3 loss: 5368391.235352
  Style 4 loss: 355980.445862
  Style 5 loss: 13842.927933
  Total loss: 11128602.623463
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 538683157
Successfully loaded models/VGG16_SOD_finetune.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8-SOD100: 1 1 4096 100
Setting up style layer          2       :       relu1_1
Setting up style layer          7       :       relu2_1
Setting up style layer          12      :       relu3_1
Setting up style layer          19      :       relu4_1
Setting up content layer        21      :       relu4_2
Setting up style layer          26      :       relu5_1
Capturing content targets
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]
  (1): nn.GPU(1) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
  }
  (2): nn.GPU(2) @ nn.Sequential {
    [input -> (1) -> output]
    (1): nn.StyleLoss
  }
  (3): nn.GPU(3) @ nn.Sequential {
    [input -> (1) -> output]
    (1): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  }
  (4): nn.GPU(4) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.ReLU
    (2): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (5): nn.GPU(5) @ nn.Sequential {
    [input -> (1) -> (2) -> output]
    (1): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
  }
  (6): nn.GPU(6) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): nn.StyleLoss
    (2): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
    (3): cudnn.ReLU
  }
  (7): nn.GPU(7) @ nn.Sequential {
    [input -> (1) -> output]
    (1): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (8): nn.GPU(8) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
    (1): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
    (4): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU
    (6): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (7): cudnn.ReLU
    (8): cudnn.SpatialMaxPooling(2x2, 2,2)
    (9): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
    (10): cudnn.ReLU
    (11): nn.StyleLoss
    (12): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (13): cudnn.ReLU
    (14): nn.ContentLoss
    (15): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (16): cudnn.ReLU
    (17): cudnn.SpatialMaxPooling(2x2, 2,2)
    (18): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (19): cudnn.ReLU
    (20): nn.StyleLoss
  }
}
Capturing style target 1
Capturing style target 2
Capturing style target 3
Capturing style target 4
Capturing style target 5
Capturing style target 6
Capturing style target 7
Capturing style target 8
Running optimization with ADAM
Iteration 50 / 200
  Content 1 loss: 1613944.433594
  Style 1 loss: 3785.628319
  Style 2 loss: 5391063.720703
  Style 3 loss: 8514136.230469
  Style 4 loss: 540189.697266
  Style 5 loss: 18604.844570
  Total loss: 16081724.554920
Iteration 100 / 200
  Content 1 loss: 1613944.433594
  Style 1 loss: 3785.628319
  Style 2 loss: 5391063.720703
  Style 3 loss: 8514136.230469
  Style 4 loss: 540189.697266
  Style 5 loss: 18604.844570
  Total loss: 16081724.554920
Iteration 150 / 200
  Content 1 loss: 1613944.433594
  Style 1 loss: 3785.628319
  Style 2 loss: 5391063.720703
  Style 3 loss: 8514136.230469
  Style 4 loss: 540189.697266
  Style 5 loss: 18604.844570
  Total loss: 16081724.554920
Iteration 200 / 200
  Content 1 loss: 1613944.433594
  Style 1 loss: 3785.628319
  Style 2 loss: 5391063.720703
  Style 3 loss: 8514136.230469
  Style 4 loss: 540189.697266
  Style 5 loss: 18604.844570
  Total loss: 16081724.554920

What is happening here, and is it possible to fix this?

Here's the nvidia-smi output:

ubuntu@ip-Address:~$ nvidia-smi

Fri Oct 20 01:58:01 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   62C    P0    63W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   47C    P0    71W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   67C    P0    61W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   52C    P0    72W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   65C    P0    66W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   48C    P0    71W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   65C    P0    66W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   48C    P0    74W / 149W |      0MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ubuntu@ip-Address:~$

Edit:

This also happened with a second content/style image combo at the same image size values.

htoyryla commented 6 years ago

No idea about loss problem, but your allocation of layers to the GPUs looks strange to me. The lower conv layers have a GPU of their own, sometimes a GPU handles nothing but a single style loss module (GPU2) or a pooling layer (GPU7). One conv layer is split so that its ReLU is handled on a different GPU (GPU3 - GPU4). And finally the whole upper half of the model (seven conv layers with their ReLUs and style loss modules) is handled by a single GPU.

I don't know if this can cause problems, but my intuition would say put a conv layer together with its associated activation layer and loss module onto the same gpu always.

Other peculiarities in your output: Style is captured 8 times while you only have 5 style layers. Don't know though if this is normal for multigpu use (which I don't have). Nvidia-smi shows no GPU memory being used.

ProGamerGov commented 6 years ago

@htoyryla

Nvidia-smi shows no GPU memory being used.

The nvidia-smi was made after the issue when I didn't have neural_style.lua running. I have a bunch of saved nvidia-smi outputs here: https://gist.github.com/ProGamerGov/8f1d07d866700b159c4b15c4e2fda868

Style is captured 8 times while you only have 5 style layers.

As for the style images, they are all the same image, except they have been rotated, and reflected into different style images. I used these scripts that I made to create them.

Don't know though if this is normal for multigpu use (which I don't have).

I haven't really experimented with multigpu before, so I could have easily screwed something up with the resource allocation parameter. But I also can't seem to find examples of anyone creating images at the same sizes as I was, with Neural-Style, so it could be an issue that was hidden until now.

How would you suggest that I structure the -multigpu_strategy parameter for a VGG-16 model?

htoyryla commented 6 years ago

Anyway, my main point was the strange way you have allocated layers to gpus:

The lower conv layers have a GPU of their own, sometimes a GPU handles nothing but a single style loss module (GPU2) or a pooling layer (GPU7). One conv layer is split so that its ReLU is handled on a different GPU (GPU3 - GPU4). And finally the whole upper half of the model (seven conv layers with their ReLUs and style loss modules) is handled by a single GPU.

I don't know if this can cause problems, but my intuition would say put a conv layer together with its associated activation layer and loss module onto the same gpu always.

Keep in mind that when assigning layers to gpus, also the ReLUs (which actually form the output of a conv layer) and style loss layers (which calculate the style loss) are counted as separate layers. Likewise the pooling layers. I don't know why the way you allocated the layers would not work, but on the other hand, putting the layers that belong together onto the same gpu should be safer. And also, to spread the memory usage evenly among the gpus, if you want to maximize the image size.

htoyryla commented 6 years ago

How would you suggest that I structure the -multigpu_strategy parameter for a VGG-16 model?

You should figure that out yourself. Make a numbered list of all layers. Mark which layers belong together so as not to put them to different gpus. Then you can experiment allocating them to different gpus. Look at nvidia-smi to see which gpus use most memory.

I would assume that the conv layers use up most of the memory. ReLUs and loss layers belong naturally on the same gpu as the preceding conv layer, at least it makes no sense put them on another gpu. I would also put a pooling layer on the same gpu as the conv layer that precedes it.

ProGamerGov commented 6 years ago

@htoyryla Thanks for the replies.

I'm not sure how I didn't see that the terminal output under each GPU, lists what layers have been assigned to them. I also was working with a 1-12 set of values for the -multigpu_strategy parameter, instead of the actually 20 possible values that are used.

There are 4 loss module layers used:

    (3): nn.StyleLoss
    (11): nn.StyleLoss
    (14): nn.ContentLoss
    (20): nn.StyleLoss

7 Convolution and ReLU pairs:

    (1): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU

    (4): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU

    (6): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (7): cudnn.ReLU

    (9): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
    (10): cudnn.ReLU

    (12): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (13): cudnn.ReLU

    (15): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (16): cudnn.ReLU

    (18): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (19): cudnn.ReLU

And 2 pooling layers:

    (8): cudnn.SpatialMaxPooling(2x2, 2,2)
    (17): cudnn.SpatialMaxPooling(2x2, 2,2)

With 8 GPUs and what you theorized in terms of GPU usage per layer, ideally I think it would be best to stick all the Convolution/ReLU pairs onto their own layers for 7 GPUs, and the lower memory using pooling and loss module layers onto the 8th GPU. But due to how the -multigpu_strategy parameter works, the layers have to be listed sequentially.

With giving every Convolution/ReLU pair their own layer, a few are going to have to also share a loss module layer or a pooling layer. But this also seems to leave a final 8th GPU unused.

These two ideas though don't factor in any changes in memory usage between each of the 7 Convolution/ReLU pairs. But I assume that like how normally higher layers use more memory, the same holds true for the -multigpu_strategy parameter. I could counteract this by making some of the lower Convolution/ReLU pairs share the same GPU, but then I still have to potentially split of the higher Convolution/ReLU pairs in order to not waste some of the GPUs. Though if spiting the layers causes the issues, then I can't do this.

I was also considering that the issue could be caused by a limitation in the libraries, or code that Neural-Style uses. Though it makes more sense that it would be a library, instead of Neural-Style's code itself. If it's caused by the optim library, that means that the ADAM optimizer is related to the issue. If it's caused by the 'image' library, then that means at some point things fail when working with extremely large images. If Lua's limitations are the cause, then maybe doing a custom install would fix things. If it's the result of CUDA/cuDNN, then there's no way I could ever hope to fix things.

For these issues, I would need to test for all of these possibilities, and hope that I catch something which clearly shows were the fault starts.

htoyryla commented 6 years ago

Of course the layers have to be assigned sequentially, as the calculation is done sequentially. One GPU calculates the first layers and then its output is passed to the next GPU, until the whole network has been calculated.

Because of this, you need to plan this with a sequential list, looking at where to place the splits. Much of your speculation is beside the point because of the sequential nature of multigpu calculation.

Most naturally, the pooling and style loss layers belong together with the conv layer before. Put the conv layer, relu, style layers and pooling layer all together. Splitting a ReLU onto a different GPU makes no sense as usually ReLU is calculated in place (simply one simple additional calculation within the convlayer). You could put a style layer or a pooling layer to the next GPU too, if that helps you to divide the memory usage better.

If there is nothing to put to the 8th GPU then don't. It will not help much to move a single simple layer there, and it will only make things slower.

There may be a problem with some of the software used, but until you configure it in a way that makes sense, you cannot know.

htoyryla commented 6 years ago

@ProGamerGov you say you have 7 conv layers, but in the output in your first post there are 11 conv layers, of which 7 have been put onto gpu8. You have now been looking only at these higher layers.

To get the full numbered list of ALL layers, run neural-style once without multigpu; you can interrupt it as soon as the optimizer is started. Also remember that the layer numbering changes when you change the style and/or content layers.

Using the defaults for content and style layers, VGG16 looks like this (however including TVLoss which is not counted when specifying the strategy parameter, something that you need to take into account):

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> output]
  (1): nn.TVLoss
  (2): nn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
  (3): nn.ReLU
  (4): nn.StyleLoss
  (5): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  (6): nn.ReLU
  (7): nn.SpatialMaxPooling(2x2, 2,2)
  (8): nn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
  (9): nn.ReLU
  (10): nn.StyleLoss
  (11): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
  (12): nn.ReLU
  (13): nn.SpatialMaxPooling(2x2, 2,2)
  (14): nn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
  (15): nn.ReLU
  (16): nn.StyleLoss
  (17): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (18): nn.ReLU
  (19): nn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (20): nn.ReLU
  (21): nn.SpatialMaxPooling(2x2, 2,2)
  (22): nn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
  (23): nn.ReLU
  (24): nn.StyleLoss
  (25): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (26): nn.ReLU
  (27): nn.ContentLoss
  (28): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (29): nn.ReLU
  (30): nn.SpatialMaxPooling(2x2, 2,2)
  (31): nn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (32): nn.ReLU
  (33): nn.StyleLoss
}
htoyryla commented 6 years ago

If there is a software problem, it is probably not in the image library, as it is only used for loading, pre/deprocessing and saving images. In your case, it looks like that the losses have been calculated properly, which means the forward pass through the model works OK. The fact that the losses do not change could indicate that the gradients end up being zero and the input image stays the same. Why this happens can be debugged when you have the process properly configured otherwise, but you have to know what you are doing.

ProGamerGov commented 6 years ago

@htoyryla I came up with using:

-gpu 0,1,2,3,4,5,6,7 -multigpu_strategy 3,6,12,15,20,26,31, -gpu 0,1,2,3,4,5,6 -multigpu_strategy 3,6,12,15,20,26, and a few other combinations.

As the default set of layers was:

Successfully loaded models/VGG16_SOD_finetune.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8-SOD100: 1 1 4096 100
Setting up style layer          2       :       relu1_1
Setting up style layer          7       :       relu2_1
Setting up style layer          12      :       relu3_1
Setting up style layer          19      :       relu4_1
Setting up content layer        21      :       relu4_2
Setting up style layer          26      :       relu5_1
Capturing content targets
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> output]
  (1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
  (2): cudnn.ReLU
  (3): nn.StyleLoss
  (4): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
  (5): cudnn.ReLU
  (6): cudnn.SpatialMaxPooling(2x2, 2,2)
  (7): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
  (8): cudnn.ReLU
  (9): nn.StyleLoss
  (10): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
  (11): cudnn.ReLU
  (12): cudnn.SpatialMaxPooling(2x2, 2,2)
  (13): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
  (14): cudnn.ReLU
  (15): nn.StyleLoss
  (16): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (17): cudnn.ReLU
  (18): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
  (19): cudnn.ReLU
  (20): cudnn.SpatialMaxPooling(2x2, 2,2)
  (21): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
  (22): cudnn.ReLU
  (23): nn.StyleLoss
  (24): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (25): cudnn.ReLU
  (26): nn.ContentLoss
  (27): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (28): cudnn.ReLU
  (29): cudnn.SpatialMaxPooling(2x2, 2,2)
  (30): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
  (31): cudnn.ReLU
  (32): nn.StyleLoss
}

I also tested different combinations of parameters, in an effort to rule out any parameters as the cause. I could affect the loss values but they always remained the same between iterations. The issue also occurs with lbfgs and -lbfgs_num_correction. Using the NIN model with even up to 6-8k resolution (With a single GPU), still works fine, so this might mean that the issue may be related to how VGG and NIN models are interact with Neural-Style.

I printed out the tensor from the self.gradInput variable from the StyleLoss:updateGradInput(input, gradOutput) function, and the values seemed to be normal. Though trying to print the tensor in other areas resulted in error messages about Lua not having enough memory.

Could this issue simply be caused by something reaching it's limit in terms of size, and then not having an error message of any kind? The only thing that seems to be constant with this issue is the -image_size value being extremely large. Maybe the once the limit is reached, it fails to finish the calculations and is constantly retrying the same iteration 1 calculations again and again? That makes more sense, because if my -multigpu_strategy parameters were the only cause, then why would they still result in working loss values at lower image sizes?

Do you know anyway that I could test for a variable hitting it's limit and stopping, or failing the calculations?

Edit:

I can confirm that the issue does not occur when using the NIN model and the -multigpu_strategy parameter with any of the values that I have tested. The issue also occurs with VGG-19 models.

htoyryla commented 6 years ago

I didn't mean to say that your multi-gpu strategy was causing the error, just that your original values were dividing the layers to the gpus in a way that is hardly optimal; as well as in a way that has hardly been the intention of the developers, and therefore maybe never tested.

It could be now that a limit is reached which does not result in an error but, instead, some variables getting zero (similar to that you may sometimes get nans or infs). If the gradients become zero, the input will not be changed and each iteration will process the same image, resulting in the same losses.

This may be tricky to catch. Most of the calculations happen under the hood, inside torch and cuda. I would start monitoring the gradients from the loss modules. You tried that already, but I am not sure if a plain print inside a module assigned to run on a gpu is a good idea. What I did recently was add local variables to the loss modules, store the gradients there and then inside feval print them out. But even that will not probably help much, we may learn something but still not be able to solve the issue (which almost certainly happens under the hood).

BTW, you say self.gradInput (for the modules you checked) looked normal. Now, as the losses get evaluated properly, that is to be expected. What counts, however, is the gradients at the input of the model, which affect how the input image is to be modified. If THAT gradient gets zero, nothing will change from iteration to iteration.

Comes to my mind, have you tried different style and content weights? They affect the gradients, so maybe significantly larger weights might help. Another factor between the input and the model is the TVLoss layer. Experiment with different settings there too. Also, TVLoss module can be a good place to get the value of the total gradient.

ProGamerGov commented 6 years ago

@htoyryla Using a style weight of 75000 did not fix the issue. Neither did disabling the content loss module via -content_weight 0. For my previous tests, -tv_weight was set to 0. I also tested using -init random and -pooling avg, but both those parameters with their other options did not fix the issue either.

The -tv_weight value does appear to be capable of "fixing" the loss values, but at the cost of the negative effects of using the -tv_weight value. The -tv_weight value seems to have a specific cutoff point in terms of it's effectiveness at "fixing" the loss values.

Using -tv_weight 0.0000005:

Running optimization with ADAM
Iteration 1 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 2 / 200
  Content 1 loss: 1836153.125000
  Style 1 loss: 4058.315277
  Style 2 loss: 5598318.603516
  Style 3 loss: 7999315.429688
  Style 4 loss: 550134.429932
  Style 5 loss: 21249.456882
  Total loss: 16009229.360294
Iteration 3 / 200
  Content 1 loss: 1712206.250000
  Style 1 loss: 6884.845018
  Style 2 loss: 9333292.236328
  Style 3 loss: 13165092.773438
  Style 4 loss: 877727.874756
  Style 5 loss: 32724.154472
  Total loss: 25127928.134012
Iteration 4 / 200
  Content 1 loss: 1606906.347656
  Style 1 loss: 9842.310905
  Style 2 loss: 13070276.367188
  Style 3 loss: 17805824.707031
  Style 4 loss: 1193168.975830
  Style 5 loss: 43954.135895
  Total loss: 33729972.844505

Using -tv_weight 0:

Running optimization with ADAM
Iteration 1 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 2 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 3 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 4 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219

-tv_weight 0.00000000001:

Running optimization with ADAM
Iteration 1 / 200
  Content 1 loss: 1994813.281250
  Style 1 loss: 1589.992940
  Style 2 loss: 2065276.977539
  Style 3 loss: 2789657.592773
  Style 4 loss: 215494.812012
  Style 5 loss: 9914.423704
  Total loss: 7076747.080219
Iteration 2 / 200
  Content 1 loss: 1994724.609375
  Style 1 loss: 1591.018081
  Style 2 loss: 2066690.734863
  Style 3 loss: 2791444.885254
  Style 4 loss: 215617.721558
  Style 5 loss: 9919.698715
  Total loss: 7079988.667846
Iteration 3 / 200
  Content 1 loss: 1994599.609375
  Style 1 loss: 1592.466652
  Style 2 loss: 2068689.147949
  Style 3 loss: 2793970.825195
  Style 4 loss: 215791.442871
  Style 5 loss: 9927.155256
  Total loss: 7084570.647299
Iteration 4 / 200
  Content 1 loss: 1994446.484375
  Style 1 loss: 1594.235837
  Style 2 loss: 2071136.352539
  Style 3 loss: 2797064.941406
  Style 4 loss: 216004.257202
  Style 5 loss: 9936.286926
  Total loss: 7090182.558286
Iteration 5 / 200
  Content 1 loss: 1994269.531250
  Style 1 loss: 1596.276283
  Style 2 loss: 2073960.937500
  Style 3 loss: 2800638.610840
  Style 4 loss: 216249.984741
  Style 5 loss: 9946.829796
  Total loss: 7096662.170410

The actual cutoff for -tv_weight affecting things in the correct direction has 14 zero past the decimal point before the 1. Using 15 zeros creates some weird endless loop pattern in the values that occurs over multiple iterations. Using 16 zeros results in no change in terms of loss values.

Using -tv_weight with 14, 15, and 16 zeros, and their associated loss values: https://gist.github.com/ProGamerGov/7338462bea65c066b57ef5668003a15d

Based on my experiments, I think the issue must be occurring in the Style Loss functions, and/or the feval(x) function. The results of my testing also seem to support the theory of something in the code reaching it's limit.


What I did recently was add local variables to the loss modules, store the gradients there and then inside feval print them out. But even that will not probably help much, we may learn something but still not be able to solve the issue (which almost certainly happens under the hood).

Would it be possible for you to share the modified code you are referring to here?

htoyryla commented 6 years ago

I think it is significant that TVLoss has an effect on the issue. According to your results, if the TVLoss layer is omitted, then the input does not change (= total gradient at input is zero). When you include the TVLoss layer, then the input changes, but the style losses increase instead of decreasing.

You suspect the problem lies within StyleLoss or feval. It may be, but remember that the loss modules and feval are only a part of the process, adam in optim package is doing much of the work in determining how the input should change before the next iteration. One could track the gradients from the loss modules to see how they behave, and it would also help to see the total gradient before and after TVLoss.

The code in which I stored internal variables within a loss module belongs to this discussion https://github.com/jcjohnson/neural-style/issues/425 . An example can be found here https://gist.github.com/htoyryla/233a9d0857440d2a8bafe732ddeba325 . In the StyleLoss module I have created a variable dbg, and then during operation I store some variable there. At line 192 I then print out the values. Note, that in this case I was only interested in what happens at target capture. You would also want to print the values from eval.

Printing out gradients (large tensors) as such is hardly very informative. In my case I was looking for nans or infs, you probably should be looking for where the gradients get to zero. Taking min, max and mean of the gradient is probably more informative than printing out whole gradients.

"the theory of something in the code reaching it's limit" is IHMO very vague. My theory is that for some reason, the input does not change from iteration to iteration, which again suggests that the total gradient is zero. Therefore one should be looking for whether it is so and why it happens.

ProGamerGov commented 6 years ago

@htoyryla I modified Neural-Style to print a whole bunch of variables from the style loss function, and the feval(x) function here: https://gist.github.com/ProGamerGov/e09e450f5c7dbb72ccac22c6244fa2f3

Functioning Loss System and the feval(x) function's grad and x variables:

The -image_size issue feval(x) function's grad and x variables:

The -image_size issue with the -tv_weight layer, and the feval(x) function's grad and x variables:


Link to the album: https://imgur.com/a/0JnVT


The raw log files for iterations 0-200 (saved every 1 iteration) for the above experiments can be found here:

-image_size 2432 & -tv_weight 0: https://gist.github.com/ProGamerGov/29e89551413bfe8ff8022df3edf822bd

-image_size 2432 & -tv_weight 0.0000005: https://gist.github.com/ProGamerGov/887307613788bde39567e287247bd209

-image_size 512 & -tv_weight 0: https://gist.github.com/ProGamerGov/c3d850318d6d80eee63890beb74ff928


It seems that the total gradient is 0, but a -tv_weight parameter above a certain value will change things. I'm not sure really how I would dig in deeper to why the gradient is 0, but I think that might require printing the variables from the optim's package's code?

ProGamerGov commented 6 years ago

After testing the GramMatrix code from here: https://github.com/jcjohnson/neural-style/blob/master/neural_style.lua#L489-L512, it looks like the same values over and over again with an image size of 2432. There aren't any obvious zeros or out of place values.

Maybe the backwards pass (self.crit:backward) might be the cause or related to the cause?

Edit, More Logs:

GramMatrix & -image_size 2432 & -tv_weight 0.0000005: https://gist.github.com/ProGamerGov/e703044aeee6f4f8404e1bea3ba22dee

GramMatrix & -image_size 2432 & -tv_weight 0: https://gist.github.com/ProGamerGov/9929e217b88dd76be8e93af2ef9e35ec

GramMatrix & -image_size 512 & -tv_weight 0: https://gist.github.com/ProGamerGov/c943491331eb422f525b1274f4e21c25

The modified neural_style.lua used to create the 3 above logs: https://gist.github.com/ProGamerGov/8eced5ca1e8cfc181a50bf5fa0be9669


This log uses -image_size 2432 & -tv_weight 0, and I tried to get more variables that I missed in the previous logs from the GramMatrix and Style Loss functions: https://gist.github.com/ProGamerGov/a6f7b47b84ad16b91d8a9977ab572212

The above log used this modified neural_style.lua: https://gist.github.com/ProGamerGov/d0ad22965d6d41001201b54310dfd3db

htoyryla commented 6 years ago

OK. So now we know that the total gradient really is zero. This was to be expected, too, because nothing was changing from iteration to iteration.

There is no point in examining what the forward pass gives (e.g. in Gram matrix). One needs to look at the gradients. In a styleloss module, the gradient is calculated as follows.

    local dG = self.crit:backward(self.G, self.target)
    dG:div(input:nElement())
    self.gradInput = self.gram:backward(input, dG)

dG is the difference between the current gram matrix output and the target. A backward pass through the gram matrix gives the gradient at (gram matrix) input.

In the latter part of the code, not shown above, the gradient is multiplied by style weight and then added to the gradient from the output, which is the way how the gradients are accumulated in the backward pass until we have the total gradient at the bottom which we then apply to the image before the next iteration.

Note also the second line, division by the number of elements in the feature maps. The idea is to make loss values from different layers more in range with each other. But it also makes the gradients smaller as the image size increases. It would be interesting to know what happens to these variables when the image size is increased until neural-style stops working. Do they gradually vanish into zero... if yes then we can modify the normalization (the 2nd line above), or do they vanish suddenly at some size.

However, it seems to me that StyleLoss alone cannot cause the problem. Even if StyleLoss would produce zero gradients for some reason, ContentLoss would still produce a nonzero gradient making the iterations to work. Unless, of course, one is using -init image and content loss is zero.

It is pity I can't try out this myself... an interesting issue. Your printouts are almost useless to me, too much information instead of focusing on the meaningful points. I am doing some tests without GPU to see what happens to the gradients when the image size increases. It looks like they do decrease with image size. Note also that even if the gradient values may not look too small at this point, they will be passed on to the optimizer and we don't really know what can happen there.

htoyryla commented 6 years ago

If the problem is caused by the gradients getting too small at larger image sizes, I have here a modified version in which the style loss module adjusts the gradient so that this does not happen: the divisor input:nElement() is corrected with a factor (512^2)/(image_size^2).

https://gist.github.com/htoyryla/0c0a1462db1bbabd793e8235557a1c4d

I am not sure if this helps but you can try. What bothers me is that even if the style loss modules would give zero gradients, there is still the content loss module to steer the image towards content (assuming -init random is used).

Note that as the gradients are scaled differently, the process also to find a different optimum. Also, effectively the style will be weighted more than before, when image_size is above 512. So you may have to experiment with style_weight to get what you expect.

PS. I also noted that even if the gradients inside a styleloss module decrease with image size, the total gradient in feval does not, at least significantly, so the optimizer must be working to adjust the gradient. What may, however, happen with a very large image, is that the individual gradients fall to such a low level that the optimizer fails to work. But this is something that I cannot reproduce with a single 8GB GPU neither with a 24G CPU.

The total gradient is calculated at this line within feval:

    local grad = net:updateGradInput(x, dy)

Actually, all the gradients are calculated here. x is the input image, dy is a tensor of zeroes, as the output of the model is irrelevant, all the gradients come from the loss modules.

htoyryla commented 6 years ago

A different idea... adam has a config parameter epsilon, which is somehow related to "numerical stability". The default is 1e-8, but for instance Google suggests that for certain tasks values up to 1.0 or 0.1 may be needed. On the other hand, somewhere I found that a higher epsilon will result in slower learning (more iterations needed).

The place to add this is https://github.com/jcjohnson/neural-style/blob/master/neural_style.lua#L235 Add a line as in

    optim_state = {
      learningRate = params.learning_rate,
      epsilon = 1e-4 
    }

or whatever value you want to try.

ProGamerGov commented 6 years ago

In order to try and rule out the criterion as the main cause of the issue, I tried replacing the MSECriterion with both the AbsCriterion, and the SmoothL1Criterion criterions, which are also used in fast-neural-style. Neither of the two criterion fixed the issue.

Using your neural_grad.lua modifications, also did not fix the issue.

For adam, an epsilon value of 1e-4 or 1 does not resolve the issue either.

I also tried using segmentation as a long shot via: https://gist.github.com/ProGamerGov/bcbd27a3d2e431adb73ef158d9990d93, as I have noticed some interesting results from how it affects parameters. But segmentation also didn't change anything.

ProGamerGov commented 6 years ago

@htoyryla I think I made a breakthrough!

New install with Cuda 9.0, and cuDNN v7:

Capturing style target 1    
Running optimization with ADAM  
Iteration 1 / 2500  
  Content 1 loss: 1992618.359375    
  Style 1 loss: 2672.421992 
  Style 2 loss: 2859453.552246  
  Style 3 loss: 10104969.726562 
  Style 4 loss: 663910.537720   
  Style 5 loss: 25689.814568    
  Total loss: 15649314.412463   
Iteration 2 / 2500  
  Content 1 loss: 2378580.078125    
  Style 1 loss: 1476.570815 
  Style 2 loss: 986022.583008   
  Style 3 loss: 3857117.431641  
  Style 4 loss: 556773.742676   
  Style 5 loss: 25292.687416    
  Total loss: 7805263.093680    
Iteration 3 / 2500  
  Content 1 loss: 2303561.914062    
  Style 1 loss: 944.920152  
  Style 2 loss: 718514.877319   
  Style 3 loss: 2983827.392578  
  Style 4 loss: 288251.037598   
  Style 5 loss: 6392.563105 
  Total loss: 6301492.704815    
Iteration 4 / 2500  
  Content 1 loss: 2202003.710938    
  Style 1 loss: 764.534175  
  Style 2 loss: 812083.007812   
  Style 3 loss: 2528113.037109  
  Style 4 loss: 184357.040405   
  Style 5 loss: 5791.568398 
  Total loss: 5733112.898839    
Iteration 5 / 2500  
  Content 1 loss: 2296690.429688    
  Style 1 loss: 655.681953  
  Style 2 loss: 805789.764404   
  Style 3 loss: 1844594.238281  
  Style 4 loss: 127826.431274   
  Style 5 loss: 3576.318741 
  Total loss: 5079132.864341    

After I tried to update Torch and it's modules. With Cuda 8.0, and cuDNN v5:

Capturing style target 1    
Running optimization with ADAM  
Iteration 1 / 2500  
  Content 1 loss: 1994813.281250    
  Style 1 loss: 2699.189365 
  Style 2 loss: 2853272.277832  
  Style 3 loss: 10104018.310547 
  Style 4 loss: 662611.404419   
  Style 5 loss: 25619.004250    
  Total loss: 15643033.467662   
Iteration 2 / 2500  
  Content 1 loss: 1994813.281250    
  Style 1 loss: 2699.189365 
  Style 2 loss: 2853272.277832  
  Style 3 loss: 10104018.310547 
  Style 4 loss: 662611.404419   
  Style 5 loss: 25619.004250    
  Total loss: 15643033.467662   
Iteration 3 / 2500  
  Content 1 loss: 1994813.281250    
  Style 1 loss: 2699.189365 
  Style 2 loss: 2853272.277832  
  Style 3 loss: 10104018.310547 
  Style 4 loss: 662611.404419   
  Style 5 loss: 25619.004250    
  Total loss: 15643033.467662   
Iteration 4 / 2500  
  Content 1 loss: 1994813.281250    
  Style 1 loss: 2699.189365 
  Style 2 loss: 2853272.277832  
  Style 3 loss: 10104018.310547 
  Style 4 loss: 662611.404419   
  Style 5 loss: 25619.004250    
  Total loss: 15643033.467662   
Iteration 5 / 2500  
  Content 1 loss: 1994813.281250    
  Style 1 loss: 2699.189365 
  Style 2 loss: 2853272.277832  
  Style 3 loss: 10104018.310547 
  Style 4 loss: 662611.404419   
  Style 5 loss: 25619.004250    
  Total loss: 15643033.467662   

It seems that the issue could have been in the Cuda and/or cuDNN versions I was using.

Though it seems that Cuda 8.0 and cuDNN v5 are significantly more memory efficient than Cuda 9.0 and cuDNN v7: https://github.com/jcjohnson/neural-style/issues/429

ProGamerGov commented 6 years ago

So I made it to an image size of 2816, where this new error occurred:

conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8-SOD100: 1 1 4096 100
Setting up style layer          2       :       relu1_1
Setting up style layer          7       :       relu2_1
Setting up style layer          12      :       relu3_1
Setting up style layer          19      :       relu4_1
Setting up content layer        21      :       relu4_2
Setting up style layer          26      :       relu5_1
Capturing content targets
nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> output]
  (1): nn.GPU(1) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): cudnn.SpatialConvolution(3 -> 64, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
  }
  (2): nn.GPU(2) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): cudnn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (3): nn.GPU(3) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
    (1): cudnn.SpatialConvolution(64 -> 128, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
    (4): cudnn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU
    (6): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (4): nn.GPU(4) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> output]
    (1): cudnn.SpatialConvolution(128 -> 256, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
  }
  (5): nn.GPU(5) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
    (1): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): cudnn.SpatialConvolution(256 -> 256, 3x3, 1,1, 1,1)
    (4): cudnn.ReLU
    (5): cudnn.SpatialMaxPooling(2x2, 2,2)
  }
  (6): nn.GPU(6) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
    (1): cudnn.SpatialConvolution(256 -> 512, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): nn.StyleLoss
    (4): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU
    (6): nn.ContentLoss
  }
  (7): nn.GPU(7) @ nn.Sequential {
    [input -> (1) -> (2) -> (3) -> (4) -> (5) -> output]
    (1): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (2): cudnn.ReLU
    (3): cudnn.SpatialMaxPooling(2x2, 2,2)
    (4): cudnn.SpatialConvolution(512 -> 512, 3x3, 1,1, 1,1)
    (5): cudnn.ReLU
  }
  (8): nn.GPU(8) @ nn.Sequential {
    [input -> (1) -> output]
    (1): nn.StyleLoss
  }
}
Capturing style target 1
Capturing style target 2
Capturing style target 3
Capturing style target 4
Capturing style target 5
Capturing style target 6
Capturing style target 7
Capturing style target 8
Running optimization with ADAM

cudnnConvolutionBackwardData failed:    9        convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA1,3,2615,2816 -filtA64,3,3,3 1,64,2615,2816 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:94: Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnConvolutionBackwardData)
stack traceback:
        [C]: in function 'error'
        /home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:94: in function 'checkedCall'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:212: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:201>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:58: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:50>
        [C]: in function 'pcall'
        /home/ubuntu/torch/install/share/lua/5.1/cutorch/init.lua:32: in function 'withDevice'
        /home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:112: in function </home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:108>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:58: in function 'updateGradInput'
        neural_style.lua:284: in function 'opfunc'
        /home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
        neural_style.lua:307: in function 'main'
        neural_style.lua:601: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50
htoyryla commented 6 years ago

I had a fleeting thought to ask about torch and cuda versions, but forgot it when there was so much else going on.

For me, these debugging sessions are a good opportunity to get more and more familiar with how the code and the process works. Now I got interested in looking at the total gradient while neural-style is running. Not mean but the min and max. Those appear to give a quite good quality indication; when the max and min values of the gradient get small enough, the image also looks good.

Also it may be that my modified code with adjusted gradients might produce interesting results... maybe.

About the criterion in style loss... thinking further it is logical that the problem was not there. The style loss criterion compares two gram matrices, which are never larger than 512x512 regardless of the image size.

ProGamerGov commented 6 years ago

For me, these debugging sessions are a good opportunity to get more and more familiar with how the code and the process works.

I've also learned quite a bit from these debugging sessions, about various things like how Neural-Style works, image color spaces, and machine learning in general.

Now I got interested in looking at the total gradient while neural-style is running. Not mean but the min and max. Those appear to give a quite good quality indication; when the max and min values of the gradient get small enough, the image also looks good.

From my experience with graphing the loss values, I discovered that the the smoothness, and the steepness of the graph, is dependent how your chosen parameters. I think that the "x" value in the feval(x) function can be used in the same way, where a smooth graph like the one I shared above, indicates a good choice of parameters.

Also it may be that my modified code with adjusted gradients might produce interesting results... maybe.

I'll try and do some tests, and see if the results are noticeably different.

About the criterion in style loss... thinking further it is logical that the problem was not there. The style loss criterion compares two gram matrices, which are never larger than 512x512 regardless of the image size.

Are these 512x512 images, shrunken down versions of the style image(s)? Or tile like pieces from the style image(s)?

htoyryla commented 6 years ago

graphing the loss values

The problem with loss values is that they never get to zero. I have seen people assuming that the absolute value of loss has a meaning, but it doesn't. Loss is only a relative indicator on the way looking for a minimum. So I got interested in looking at the gradients, which represent the actual modifications we are performing on the image. When they get below a certain level then the image is no longer evolving significantly. But on the graph level, when you are looking at it after the process is finish, they may well look quite similar.

I think that the "x" value in the feval(x) function can be used in the same way, where a smooth graph like the one I shared above, indicates a good choice of parameters.

If I am not mistaken, x is actually the input, that means the current image.

Are these 512x512 images, shrunken down versions of the style image(s)? Or tile like pieces from the style image(s)?

Neither. No images at all. The gram matrix calculated from N spatial feature maps of size H x W loses the dimensions and outputs an N x N tensor, which functions a kind of statistical representation of color and texture in the image, but not how they are spatially distributed in the image. This is why people (including yourself) have been so keen on looking for additional methods to add spatial control to the style transfer.

The total gradient, on the other hand, corresponds to the current image indicating, pixel per pixel, to which direction the image needs to be modified. Which led me to an idea of visualizing the gradients somehow, I mean spatially, as images, not statistical graphs, not sure if it makes sense but maybe I'll try something. Kind of showing the difference between the current image and an ideal one, as an image.

htoyryla commented 6 years ago

I'll try and do some tests, and see if the results are noticeably different.

The main point is how it works when the image_size is increased, compared to 512. At image size 512 it should work exactly like the original.

htoyryla commented 6 years ago

This gist contains modified neural-style to display the total gradient which is converted to an image by normalizing to range 0 .. 255. It just gives a rough visual indication how and weher the image is changing during the iteration process. https://gist.github.com/htoyryla/445e2649293f702a940c58a8a3cef472

Interesting to compare how l-bfgs and adam handle the same image with the same settings.

htoyryla commented 6 years ago

This has nothing to with the current issue, but just testing an idea I got from the gram matrix explanation I gave above.

This version of neural-style evaluates style by two methods: gram matrix and a mean of the feature maps. Gram matrix is not spatial while the mean matrix is.

A new param --mean_weight controls the amount of the mean for style, the old style_weight param continues to control the weight of the gram matrix.

There is also new param -loss_type to select between L2 and SmoothL1, as in fast-neural-style from where I copied the StyleLoss code (even if I modified the mean calculation) https://gist.github.com/htoyryla/fdcab165b397b1b5c4986f4877454d3a

ProGamerGov commented 6 years ago

@htoyryla Here are my results from using your neural-mean2.lua:

Full sized comparison image link: https://i.imgur.com/2y5CnrH.jpg

Full album link with style & content images: https://imgur.com/a/dUdzD

It almost seems to act like it was increasing the content weight. At 1e4, it copied a good portion of a large object in the style image.


Something else I was wondering about, was if it was possible to insert a histogram matching color transfer step in between each iteration, in order to help prevent unwanted spots from forming. I've found that it benefits the multi-scale resolution technique when it's used in between steps, but I think that it could be more beneficial if it was used for every iteration.

Currently I have found that following a pattern like this can significantly help reduce unwanted grey spots, and help prevent the "grey haze" regions (which eventually become the grey spots) from becoming as prominent:

th neural_style.lua -output_image out1.png -style_image style_image.png

python linear-color-transfer.py --target_image out1.png --source_image style_image.png --output_image out1_hist.png

th neural_style.lua -init_image out1_hist.png -output_image out2.png -style_image style_image.png

The linear-color-transfer.py script comes from here.

See this album for an example of the grey spot/haze issue, and how my solution prevents the grey haze and grey spots from becoming as prominent: https://imgur.com/a/OTAlv

htoyryla commented 6 years ago

It almost seems to act like it was increasing the content weight. At 1e4, it copied a good portion of a large object in the style image.

Yes, that is because it uses the average of the feature maps obtained from the style image as a target. I found that usually mean_weight should be clearly less than style weight to produce good results.

Here's an interesting image that uses the Tübingen image as content and adds a feeling of nature and woods from another image.

22851711_10156015046668729_7759568598390706963_n

htoyryla commented 6 years ago

Currently I have found that following a pattern like this can significantly help reduce unwanted grey spots, and help prevent the "grey haze" regions (which eventually become the grey spots) from becoming as prominent.

I, too, was thinking of the grey areas when I was looking at the gradient on display. I got a good image with l-bfgs but adam produced grey areas, looking at the gradient it was obvious that not much was happening in those areas at all. I started wondering whether one could manipulate the gradient a bit to make something happening. As l-bfgs worked fine, the problem was not in the images nor in the model but in the optimization.

I think it should be possible to manipulate the image between iterations, although it might disturb the inner workings of the optimizer. We discussed and tested something related to feval and optimizer once, I don't remember the results.

ProGamerGov commented 6 years ago

@htoyryla

I got a good image with l-bfgs but adam produced grey areas

As l-bfgs worked fine, the problem was not in the images nor in the model but in the optimization.

While l-bfgs is extremely good at preventing these grey spots, it's not 100% perfect and I have seen cases where it created grey spots like adam (though they weren't nearly as bad as adam's grey spots). Though adam seems to be a lot worse for creating grey spots, and in many cases it creates uglier grey spots than l-bfgs. At the very least, you can sometimes notice almost grey areas that l-bfgs creates, which can grow into large grey spots with adam.

looking at the gradient it was obvious that not much was happening in those areas at all.

Yea, I've been trying to figure out how to deal with the issue for a while now, and the best solution I came up with (constant histogram matching) seems like it's more of band-aid, than a solution in it's current form.

htoyryla commented 6 years ago

While l-bfgs is extremely good at preventing these grey spots, it's not 100% perfect and I have seen cases where it created grey spots like adam

I didn't mean to claim that l-bfgs is free from grey areas. My point was rather the following: I have seen someone to explain the grey area as a result of weak activations, i.e. not much response in the feature maps in those areas. If that is true and the model does not respond to an image, then there is not much one can do. As if the model does not understand what is in the image in that area. The only thing to do is to try different layers if they work better.

But if l-bfgs succeeds with specific images while adam fails, then there must be enough activations in the feature maps, and the result depends rather on how well the optimization succeeds.

Would be interesting to look deeper into what is happening in the feature maps and gradients when a really clear case of grey areas happens. Maybe later though.

ProGamerGov commented 6 years ago

I've had a suspicion that clipping, might be responsible for the grey spots/haze, but I haven't really had the chance to test the theory.

It also occurs to me that many, if not all of the style images that I've had grey spot/haze problems with, are artwork which was created digitally. I think this might have something to do with a specific part of the content, like areas of constant color and smooth texture, that are similar to clipping in photography.

htoyryla commented 6 years ago

My first assumption was indeed, that the grey areas result from something in the images which leads to weak activations in the feature maps. So maybe this can happen if the images have areas that are rather featureless. I wonder what happens if one adds a bit of noise, either to the image(s) or to the feature maps.

Clipping in the neural-style process itself would cause stronger effects (dark or bright areas, wrong colors) because clipping happens when the values become extreme. Grey means the values stay very close to zero, even the three color channels do not deviate from each other much to produce colored pixels or spots.

Can you give a style image which has this problem, and why not also the content image.

ProGamerGov commented 6 years ago

@htoyryla This style image seems to cause grey spots/regions on multiple content images:

Style Image: https://i.imgur.com/JZxUhfA.jpg

Content Image: https://i.imgur.com/HKF5yGD.jpg

The Output: https://i.imgur.com/dOYNv2e.png

This is probably the best example for experimenting with repeatable grey spots. Adam tends to make some really bad grey spots with this style image.

The style scale for the output was either 1, 0.5 or 0.75. I am not sure if that will make a difference for experimentation (maybe different style image sizes affect clipping?).


In the past, I would generally stop and move onto a new set of style and content image combinations when I encountered grey spots, so I will have to do some searching if we need more examples.

ProGamerGov commented 6 years ago

My first assumption was indeed, that the grey areas result from something in the images which leads to weak activations in the feature maps. So maybe this can happen if the images have areas that are rather featureless. I wonder what happens if one adds a bit of noise, either to the image(s) or to the feature maps.

I've been thinking that a certain amount of "texture" or detail is needed for the best possible results. Maybe we might be able to add this in some way or another without changing the style too much?

Clipping in the neural-style process itself would cause stronger effects (dark or bright areas, wrong colors) because clipping happens when the values become extreme. Grey means the values stay very close to zero, even the three color channels do not deviate from each other much to produce colored pixels or spots.

I've noticed that sometimes various geometric designs can be viewed on the grey spots/regions, you zoom in far enough. These weird artifacts remind me of my experimentation with DeepDream hallucinations. But these could easily be just stuff from the style or content image which interacted with the grey spots/regions.

The histogram matching script that I use, actually checks for and changes colors that might cause the Neural-Style clipping that Gatys et al had to deal with: https://github.com/ProGamerGov/Neural-Tools/blob/master/linear-color-transfer.py#L64-L65, so I don't think really bright or dark areas are the cause of grey spots.

htoyryla commented 6 years ago

It seems to me we are talking about a different phenomenon. I don't see anything strange in your results. It just happens that the neural-style process "perceives" those areas differently from what you expect, and uses elements from the style image in places which is not correct in your eyes. I have seen cases when neural-stryle did not perceive an arm correctly, so that the upper part of the arm looked like a part of the background in the output image. You probably can affect this by processing the content image to make the shape of the arm clearer?

I think we humans are good at mentally correcting what we see, similar to that we can read a text even without paying attention to the misspellings. Similarly, when we see a shape, we know it is an arm, even if part of it might actually be not clear from the background. The neural models we have today are not very good in this.

I made some tests using these images and it seems to me that finding the correct balance between content and style is difficult and probably impossible. Using a lower style weight produces content faithfully but when the effect of the style is increased, style features start to dominate and parts of the content will be misrepresented (like missing a part of an arm) and replaced by a style feature (such as a colored blob). In the search for a compromise between content and style, the weaker content features will give way to stronger style features.

In the tests I used neural_mean.lua but without the mean style layers. The main difference then was that the style image is deformed into the same shape as the content image.

I also modified my neural_mean.lua to optionally add noise in the content and loss modules, so that the magnitude of the noise is gradually decreased in successive iterations. I think it can be a useful feature, as it might, used moderately, help the optimization to converge.

Like here, in the first image (style weight 1e3) the left arm is clearly visible (although more faint that the rest of the body), in the second image (style weight 2e3) the arm already looks odd, as the style features are becoming more prominent.

greyspot-n0b-nomean-sw1e3

greyspot-n0b-nomean-sw2e3

ProGamerGov commented 6 years ago

I have run into the same size limitation error as before, but this time with a higher image size of 3328:

Capturing style target 1
Running optimization with ADAM

cudnnFindConvolutionBackwardDataAlgorithm failed:       2        convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA1,64,3328,2473 -filtA64,64,3,3 1,64,3328,2473 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 2 module of nn.Sequential:
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionBackwardDataAlgorithm failed, sizes:  convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA1,64,3328,2473 -filtA64,64,3,3 1,64,3328,2473 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT
stack traceback:
        [C]: in function 'error'
        /home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'backwardDataAlgorithm'
        ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:209: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:201>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:58: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:50>
        [C]: in function 'pcall'
        /home/ubuntu/torch/install/share/lua/5.1/cutorch/init.lua:32: in function 'withDevice'
        /home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:112: in function </home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:108>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:55: in function 'updateGradInput'
        neural_style.lua:284: in function 'opfunc'
        /home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
        neural_style.lua:307: in function 'main'
        neural_style.lua:601: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:55: in function 'updateGradInput'
        neural_style.lua:284: in function 'opfunc'
        /home/ubuntu/torch/install/share/lua/5.1/optim/adam.lua:37: in function 'adam'
        neural_style.lua:307: in function 'main'
        neural_style.lua:601: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

In SpatialConvolution.lua, on line 201: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L201

In SpatialConvolution.lua, on line 209: https://github.com/soumith/cudnn.torch/blob/master/SpatialConvolution.lua#L209

In find.lua, on line 483: https://github.com/soumith/cudnn.torch/blob/master/find.lua#L483


A similar error from VaKonS's tiled Neural-Style:

Scale: 1-1, simple scaling. 
Running optimization with L-BFGS    
Processing image part #1 of 20 ([1, 1] of 3552x4096, scaling factor 1.000). 
Resetting network.  
Capturing content targets   
Capturing style target 1    

cudnnFindConvolutionForwardAlgorithm failed:    2    convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA1,3,3343,6144 -filtA64,3,3,3 1,64,3343,6144 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT  
/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:483: cudnnFindConvolutionForwardAlgorithm failed, sizes:  convDesc=[mode : CUDNN_CROSS_CORRELATION datatype : CUDNN_DATA_FLOAT] hash=-dimA1,3,3343,6144 -filtA64,3,3,3 1,64,3343,6144 -padA1,1 -convStrideA1,1 CUDNN_DATA_FLOAT
stack traceback:
    [C]: in function 'error'
    /home/ubuntu/torch/install/share/lua/5.1/cudnn/find.lua:483: in function 'forwardAlgorithm'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:179: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:175>
    [C]: in function 'xpcall'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:41>
    [C]: in function 'pcall'
    /home/ubuntu/torch/install/share/lua/5.1/cutorch/init.lua:32: in function 'withDevice'
    /home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:92: in function </home/ubuntu/torch/install/share/lua/5.1/nn/GPU.lua:88>
    [C]: in function 'xpcall'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    neural_style_v.lua:303: in function 'process_img'
    neural_style_v.lua:541: in function 'main'
    neural_style_v.lua:841: in main chunk
    [C]: in function 'dofile'
    ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    neural_style_v.lua:303: in function 'process_img'
    neural_style_v.lua:541: in function 'main'
    neural_style_v.lua:841: in main chunk
    [C]: in function 'dofile'
    ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405d50


Edit:

Using cudnn.verbose = true, it seems that it may be a lack of memory issue after all:

https://gist.github.com/ProGamerGov/9e5b367a90cd4be9cbd1ed023dafbb81

I thought I could go a lot higher in terms of image size, but I did that one the install with a different version of Torch and Cuda/cuDNN. Either Torch7 or Cuda/cuDNN has gotten more inefficient, and that is probably why I can't get any higher in terms of image size: https://github.com/jcjohnson/neural-style/issues/429

ProGamerGov commented 6 years ago

@htoyryla On the subject of grey spots/haze, I collect a bunch of examples caused by the adam optimizer. These did not appear until after I switched from using lbfgs to adam. Histogram Matching helps make the grey spots/haze less prominent, but I'd like to do better. Do you have any ideas on how I can make them even less prominent?

I have tried changing the various parameters like learning rate, but that didn't seem to solve the problem. This issue also seems to occur with digital art style images, as adam is fine with paintings and other non digital artwork.

ProGamerGov commented 6 years ago

In the optim library. adam.lua contains all the usable parameters for the Adam optimizer:

ARGS:
- 'opfunc' : a function that takes a single input (X), the point
             of a evaluation, and returns f(X) and df/dX
- 'x'      : the initial point
- 'config` : a table with configuration parameters for the optimizer
- 'config.learningRate'      : learning rate
- `config.learningRateDecay` : learning rate decay
- 'config.beta1'             : first moment coefficient
- 'config.beta2'             : second moment coefficient
- 'config.epsilon'           : for numerical stability
- 'config.weightDecay'       : weight decay
- 'state'                    : a table describing the state of the optimizer; after each
                              call the state is modified

Normally we just manipulate the learning rate parameter to solve issues with Adam, but in order to correct these grey spot/haze issues, I think that manipulating one of the other parameters could help solve the issue. The research paper for the Adam optimizer can be found here: https://arxiv.org/abs/1412.6980, and it may yield important clues as whether or not we can solve the issue by manipulating Adam's parameters.

It appears that we can have learning rate decay when using Adam with the learningRateDecay parameter. We can also use the weightDecay parameter, though I am unsure of how it would affect the output image.

Interestingly, the grey spots/haze may not be caused by the lack of texture, and instead according to my experiments seems to be more related to the color of the style image and how Neural-Style drifts from that style color. Also keep in mind that the goal of my experiments here are to allow for the use of the adam optimizer, to extend the image size after the lbfgs optimizer can no longer be used due to memory constraints.

From the paper, on how Adam works from the first page of the research paper:

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation.

I am no expert in regards to stochastic optimization, but it seems that the beta1 and beta2 parameter values seem to be of particular significance for how well Adam works. Maybe changing them might fix the issue, seeing as they are fixed values.

I also have to test the epsilon parameter again, as previously it had no effect due to a Cuda/cuDNN/Torch issue with really large image sizes, which was fixed in a subsequent version of the problem's library.

htoyryla commented 6 years ago

I am aware of the parameters for adam, and while not at all an expert on this either, I think you may have a point. Perhaps l-bfgs performs better just because it adapts better to the optimization task at hand, whereas adam would require fine-tuning of the parameters.

ProGamerGov commented 6 years ago

@htoyryla Adam seems to act sort of like a higher content weight with some style/content combos, and with others it seems to create less details, or less "refined" details. The "less refined" details can be solved by running the output through your Multiscale Resolution steps again from start to finish. So I don't think these neural "Grayout/Greyout" effects are something that Adam always creates, as I have plenty of outputs where Adam did not create any gray spots/haze.

Experimenting Adam's epsilon parameter yields very interesting results. It seems that a higher epsilon value decreases the prominence of, and eventually eliminates gray spots/haze entirely. Since the epsilon parameter is for ensuring Adam's "numerical stability", and knowing my results with texture versus color, I think that certain color combinations or way that color is arranged (constant color versus color gradients) cause numerical instability in Adam. This numerical instability is then reflected as gray spots/haze, but it does not affect the loss values you see in the terminal in a noticeable way.

The results of my Adam Epsilon experiment:

There doesn't seem to be any negative effects created by Neural-Style with a different epsilon value for Adam, other than some stylization changes.

htoyryla commented 6 years ago

So I don't think these neural "Grayout/Greyout" effects are something that Adam always creates, as I have plenty of outputs where Adam did not create any gray spots/haze.

Adam does not really create anything, it looks for a minimum in a HxWx3 dimensioned space. As there are several peaks and valleys, the optimizer algorithm and parameters have an effect on which minimum is found.

ProGamerGov commented 6 years ago

Here's a better comparison of different epsilon values for the Adam optimizer:

Direct link to the full image: https://i.imgur.com/tQDF2Qd.jpg

The full album: https://imgur.com/a/PU004

ProGamerGov commented 6 years ago

Adam does not really create anything, it looks for a minimum in a HxWx3 dimensioned space.

I should have clarified that I meant "Neural-Style with Adam, creates" instead of "Adam creates".

As there are several peaks and valleys, the optimizer algorithm and parameters have an effect on which minimum is found.

So would there be multiple different but usable parameter configurations that would work with Neural-Style? Or is there one valley that would be better than the rest for Neural-Style?

htoyryla commented 6 years ago

So would there be multiple different but usable parameter configurations that would work with Neural-Style? Or is there one valley that would be better than the rest for Neural-Style?

The truth is that we cannot know. Presumably, the best valley would be the lowest, but how to find it? The loss is a function of HxWxC variables; how can we know about its valleys and peaks. The optimizer is like a nearly blind animal crawling across this landscape in a million-dimensional space; moving as defined by the algorithm and the parameters, never seeing the whole landscape from afar. The starting point is also significant; this is why different runs with -init random can produce different results.

ProGamerGov commented 6 years ago

I discovered that the beta1 parameter cannot be equal to 1 or else it results in nan for all the loss values.

The beta2 parameter seems to have significantly less of an effect on the output than the beta1 parameter. It also appears like one can manipulate the beta1 parameter to affect the stylization features, and then the epsilon parameter can be tuned to remove the grey spots/haze that form as a result of changes to the beta1 parameter.

Experimenting with the beta1 and beta2 values:

While experimenting with beta1, beta2, and epsilon values, I found what appears to be a second valley:

More experimentation:

Full Album: https://imgur.com/a/8GFMw

ProGamerGov commented 6 years ago

After some more experimentation, I think that using this combination settings for Adam, creates better results, or at least results that are equal in quality to the default Adam parameter values:

    optim_state = {
      learningRate = params.learning_rate,
      beta1 = 0.99,
      epsilon = 1e-1,
    }

But unlike the default values, these values don't create those annoying gray spots.

ProGamerGov commented 6 years ago

@htoyryla

Have you given any thought into ways that we could optimize neural_style.lua, even slightly more than it is currently? I'd like to try and counteract the newer Cuda/cuDNN/Torch7 memory usage increases.

I was also thinking that using something like pHash, might let use measure the total change between the input images, and the output image. Do you think this is possible? This might provide some interesting insight into the style transfer process.

ProGamerGov commented 6 years ago

I have identified a new issue in Neural-Style than I haven't seen described anywhere else:

This issue appears to related to the FCN-32s PASCAL model, which can be found here: https://github.com/shelhamer/fcn.berkeleyvision.org

None of the style images contained anything that would resemble these gray rectangles, yet this artifact keeps showing up. Unlike the gray haze/spots issue from earlier, the gray rectangle issue occurs while using L-BFGS. So I think that something in the model might be causing this issue.

htoyryla commented 6 years ago

"So I think that something in the model might be causing this issue."

Yes... it is good to remember that neural-style evaluates both content and style, not directly, but through the activations in the model that each image raises. So the results will depend on the model. Anything the model does not perceive well, is likely to be absent or poorly presented in the result, and vice versa.

Early on when I experimented with a model trained with simple geometric shapes only, I came to realize that it failed to do a good job on the content image. I was thinking of maybe trying two separate models, one for content and anothee for style.