Questions about the performance

Hello, What are used as the training content and style images? Are coco_2014_train dataset used as the content images? How's your performance?

Hello, What are used as the training conten and style images? Are coco_2014_train dataset are used as the content images? How's your performance?

In original paper, content and style images are selected randomly from coco dataset, but I didn't perform my code on coco, but on an other dataset for my own task. In original paper, it didn't offer any quantized measure except user study, I don't know how to show out my performance. But you can have a try.

Thans for you reply. There are two questions that confused me. First, in the reimplement of laplacian loss function, I found it seems diffferent from the original paper.
'computed over its six-connected neighbors'. I think it should use "coef" not "coef_out" to calculate the laplacian loss. Because "coef" has a shape of [B, 12, 8, H, W] and "coef_out" has a shape of [B, 96, H, W]. To calculate the six-connected neighbors, It should be computed along the third, fourth and fifth dimenssion. For example, like this,

def calc_laplacian_regularizer_loss(self, weights, l1=0.0, l2=0.0):
        if not l1 and not l2:
            return 0.0
        diff1 = weights[:, :, 1:, :, :] - weights[:, :, :-1, :, :]
        diff2 = weights[:, :, :, 1:, :] - weights[:, :, :, :-1, :]
        diff3 = weights[:, :, :, :, 1:] - weights[:, :, :, :, :-1]
        if l1:
            result1 = torch.abs(diff1).sum()
            result1 += torch.abs(diff2).sum()
            result1 += torch.abs(diff3).sum()
        if l2:
            result2 = torch.pow(diff1, 2).sum()
            result2 += torch.pow(diff2, 2).sum()
            result2 += torch.pow(diff3, 2).sum()
        if l1 and not l2:
            return result1
        elif not l1 and l2:
            return result2
        else:
            return result1 + result2

Second, the output of the network are likely not in [0-1] or [-1, 1]. To get the final results, do you nomarlize the output using its min and max? In the process of calculating style and conten loss, directely feeding the output to VGG will impact the performance ?? Because the content image and style image are both [0-1].

Waiting for you response.

Thans for you reply. There are two questions that confused me. First, in the reimplement of laplacian loss function, I found it seems diffferent from the original paper. 'computed over its six-connected neighbors'. I think it should use "coef" not "coef_out" to calculate the laplacian loss. Because "coef" has a shape of [B, 12, 8, H, W] and "coef_out" has a shape of [B, 96, H, W]. To calculate the six-connected neighbors, It should be computed along the third, fourth and fifth dimenssion. For example, like this,
def calc_laplacian_regularizer_loss(self, weights, l1=0.0, l2=0.0):
        if not l1 and not l2:
            return 0.0
        diff1 = weights[:, :, 1:, :, :] - weights[:, :, :-1, :, :]
        diff2 = weights[:, :, :, 1:, :] - weights[:, :, :, :-1, :]
        diff3 = weights[:, :, :, :, 1:] - weights[:, :, :, :, :-1]
        if l1:
            result1 = torch.abs(diff1).sum()
            result1 += torch.abs(diff2).sum()
            result1 += torch.abs(diff3).sum()
        if l2:
            result2 = torch.pow(diff1, 2).sum()
            result2 += torch.pow(diff2, 2).sum()
            result2 += torch.pow(diff3, 2).sum()
        if l1 and not l2:
            return result1
        elif not l1 and l2:
            return result2
        else:
            return result1 + result2
Second, the output of the network are likely not in [0-1] or [-1, 1]. To get the final results, do you nomarlize the output using its min and max? In the process of calculating style and conten loss, directely feeding the output to VGG will impact the performance ?? Because the content image and style image are both [0-1].

Waiting for you response.

Thanks for your questions. For the first question, I know that in the paper laplacian loss is computed across three dimension, but I can not understand why we should regularized affine transform parameter acorss RGB. I think spatial regularization is also OK, and the experiment result seems fine with just 2-D regularization. Anyway, you are right about this question. I will update the 3-d regularization code soon. For the second question, the paper doesn't say whether to use normalization for output or not. I find the code in HDRNet doesn't use normalization, and it seems that it won't do harm to the performance according to the my experiment.

Thans for you reply. There are two questions that confused me. First, in the reimplement of laplacian loss function, I found it seems diffferent from the original paper. 'computed over its six-connected neighbors'. I think it should use "coef" not "coef_out" to calculate the laplacian loss. Because "coef" has a shape of [B, 12, 8, H, W] and "coef_out" has a shape of [B, 96, H, W]. To calculate the six-connected neighbors, It should be computed along the third, fourth and fifth dimenssion. For example, like this,
def calc_laplacian_regularizer_loss(self, weights, l1=0.0, l2=0.0):
        if not l1 and not l2:
            return 0.0
        diff1 = weights[:, :, 1:, :, :] - weights[:, :, :-1, :, :]
        diff2 = weights[:, :, :, 1:, :] - weights[:, :, :, :-1, :]
        diff3 = weights[:, :, :, :, 1:] - weights[:, :, :, :, :-1]
        if l1:
            result1 = torch.abs(diff1).sum()
            result1 += torch.abs(diff2).sum()
            result1 += torch.abs(diff3).sum()
        if l2:
            result2 = torch.pow(diff1, 2).sum()
            result2 += torch.pow(diff2, 2).sum()
            result2 += torch.pow(diff3, 2).sum()
        if l1 and not l2:
            return result1
        elif not l1 and l2:
            return result2
        else:
            return result1 + result2
Second, the output of the network are likely not in [0-1] or [-1, 1]. To get the final results, do you nomarlize the output using its min and max? In the process of calculating style and conten loss, directely feeding the output to VGG will impact the performance ?? Because the content image and style image are both [0-1]. Waiting for you response.
Thanks for your questions. For the first question, I know that in the paper laplacian loss is computed across three dimension, but I can not understand why we should regularized affine transform parameter acorss RGB. I think spatial regularization is also OK, and the experiment result seems fine with just 2-D regularization. Anyway, you are right about this question. I will update the 3-d regularization code soon. For the second question, the paper doesn't say whether to use normalization for output or not. I find the code in HDRNet doesn't use normalization, and it seems that it won't do harm to the performance according to the my experiment.

okay. I trained the the network for the style transfer task and adopted the coco_2014_train as the content images and used the images (landscape dataset: 4139, flickr website: 5502) as style images. The only differences from you code are: 1). In the splatting block, I used concatenation insted of addition. 2). In the fusion of global and local feature maps, I didn't use a convolution and directly added the two features, followed by the relu activation. 3) In the laplacian loss function, I used the aforementioned code, which is an approximate version of six-connected neighbors. And you code adds one dimension is:

class LaplacianRegularizer(nn.Module):
    def __init__(self):
        super(LaplacianRegularizer, self).__init__()
        self.mse_loss = torch.nn.MSELoss(reduction='sum')

    def forward(self, f):
        loss = 0.
        # f: [B, 12, 8, H, W]
        B, C, D, H, W = f.shape
        for k in range(D):
            for i in range(H):
                for j in range(W):
                    front = max(k - 1, 0)
                    back = min(k + 1, D - 1)
                    up = max(i - 1, 0)
                    down = min(i + 1, H - 1)
                    left = max(j - 1, 0)
                    right = min(j + 1, W - 1)
                    term = f[:, :, k, i, j].view(B, C, 1, 1, 1).expand(B, C, back - front + 1, down - up + 1, right - left + 1)
                    loss += self.mse_loss(term, f[:, :, front:back + 1, up:down + 1, left:right + 1])
        return loss

4) I tried the standard and the normalized vgg weight to calculate the loss.

But, the results is not pleasing. It seems that the learned style is different from the style image. After this, I also tried to train only on one category images( ie, landscapes), also getting a unpleasing result. So I am wondering that is this caused by the training images pairs? ? The original paper says that they random select 50K image pairs for training, are they all from coco datasets? I sent eamil to the author and they say that the used the images collected from unsplash.com website as style images.

So the quetion is: for style transfer task, Is it correct for my selection of training dataset?

I will try to train your code to find that if there is still the problem.

Thans for you reply. There are two questions that confused me. First, in the reimplement of laplacian loss function, I found it seems diffferent from the original paper. 'computed over its six-connected neighbors'. I think it should use "coef" not "coef_out" to calculate the laplacian loss. Because "coef" has a shape of [B, 12, 8, H, W] and "coef_out" has a shape of [B, 96, H, W]. To calculate the six-connected neighbors, It should be computed along the third, fourth and fifth dimenssion. For example, like this,
def calc_laplacian_regularizer_loss(self, weights, l1=0.0, l2=0.0):
        if not l1 and not l2:
            return 0.0
        diff1 = weights[:, :, 1:, :, :] - weights[:, :, :-1, :, :]
        diff2 = weights[:, :, :, 1:, :] - weights[:, :, :, :-1, :]
        diff3 = weights[:, :, :, :, 1:] - weights[:, :, :, :, :-1]
        if l1:
            result1 = torch.abs(diff1).sum()
            result1 += torch.abs(diff2).sum()
            result1 += torch.abs(diff3).sum()
        if l2:
            result2 = torch.pow(diff1, 2).sum()
            result2 += torch.pow(diff2, 2).sum()
            result2 += torch.pow(diff3, 2).sum()
        if l1 and not l2:
            return result1
        elif not l1 and l2:
            return result2
        else:
            return result1 + result2
Second, the output of the network are likely not in [0-1] or [-1, 1]. To get the final results, do you nomarlize the output using its min and max? In the process of calculating style and conten loss, directely feeding the output to VGG will impact the performance ?? Because the content image and style image are both [0-1]. Waiting for you response.
Thanks for your questions. For the first question, I know that in the paper laplacian loss is computed across three dimension, but I can not understand why we should regularized affine transform parameter acorss RGB. I think spatial regularization is also OK, and the experiment result seems fine with just 2-D regularization. Anyway, you are right about this question. I will update the 3-d regularization code soon. For the second question, the paper doesn't say whether to use normalization for output or not. I find the code in HDRNet doesn't use normalization, and it seems that it won't do harm to the performance according to the my experiment.
okay. I trained the the network for the style transfer task and adopted the coco_2014_train as the content images and used the images (landscape dataset: 4139, flickr website: 5502) as style images. The only differences from you code are: 1). In the splatting block, I used concatenation insted of addition. 2). In the fusion of global and local feature maps, I didn't use a convolution and directly added the two features, followed by the relu activation. 3) In the laplacian loss function, I used the aforementioned code, which is an approximate version of six-connected neighbors. And you code adds one dimension is:
class LaplacianRegularizer(nn.Module):
    def __init__(self):
        super(LaplacianRegularizer, self).__init__()
        self.mse_loss = torch.nn.MSELoss(reduction='sum')

    def forward(self, f):
        loss = 0.
        # f: [B, 12, 8, H, W]
        B, C, D, H, W = f.shape
        for k in range(D):
            for i in range(H):
                for j in range(W):
                    front = max(k - 1, 0)
                    back = min(k + 1, D - 1)
                    up = max(i - 1, 0)
                    down = min(i + 1, H - 1)
                    left = max(j - 1, 0)
                    right = min(j + 1, W - 1)
                    term = f[:, :, k, i, j].view(B, C, 1, 1, 1).expand(B, C, back - front + 1, down - up + 1, right - left + 1)
                    loss += self.mse_loss(term, f[:, :, front:back + 1, up:down + 1, left:right + 1])
        return loss
I tried the standard and the normalized vgg weight to calculate the loss.

But, the results is not pleasing. It seems that the learned style is different from the style image. After this, I also tried to train only on one category images( ie, landscapes), also getting a unpleasing result. So I am wondering that is this caused by the training images pairs? ? The original paper says that they random select 50K image pairs for training, are they all from coco datasets? I sent eamil to the author and they say that the used the images collected from unsplash.com website as style images.

So the quetion is: for style transfer task, Is it correct for my selection of training dataset?

I will try to train your code to find that if there is still the problem.

The dataset I use contain 3000+ content images and 7 style images. And it can work. Though I didn't try it on other dataset, I dont't think the choice of the dataset matters. Maybe longer training time helps.

Hello, I re-trained your network using the following images: 1) coco_2014 as content images, 9821 landscapes as style images. 2) random select 50K image pairs in coco_2014 as training samples. 3) 5502 landscape as content images, 4319 images as style images. 4) 9821 landscape as content images, only 29 images as style images.

I trained the network about half and one days, but, I didn't get the wanted sytle in the prediction. The prediction also looks like the input content images. The first and second rows are the content images and style images, the third rows show the predicted images:

output45000 output55000

The loss curve looks like this:

So, Is the weight of style loss two small, which leads to that It can't realize style transfer? or the weight of laplacian loss too large (loss_l varies from 2 to 0.002), which impacts the optimization process?

Hoping to get your advices. ~

Hello, I re-trained your network using the following images:

coco_2014 as content images, 9821 landscapes as style images.

random select 50K image pairs in coco_2014 as training samples.

5502 landscape as content images, 4319 images as style images.

9821 landscape as content images, only 29 images as style images.

I trained the network about half and one days, but, I didn't get the wanted sytle in the prediction. The prediction also looks like the input content images. The first and second rows are the content images and style images, the third rows show the predicted images:

The loss curve looks like this:

So, Is the weight of style loss two small, which leads to that It can't realize style transfer? or the weight of laplacian loss too large (loss_l varies from 2 to 0.002), which impacts the optimization process?

Hoping to get your advices. ~

It is very weird, and I can not figure out why your training didn't work. If the style loss is small enough, the style will absolutely change even though the style generated is not what you wanted. Generally, at 5~6 epoch, you can obviously see that the style of content images change. Here is the result of my result at epoch 2, with the content and style images from VOC2007 datasets. epoch2 The style changes. You can try these: 1、Check whether your training pipeline is wrong. 2、Try small set of styles 3、And, as you say, you can increase the weight of style loss.

Thank you for sharing the program, it helps a lot. BTW, I found several detail that are different from the paper in this implementation:

The VGG extraction layer, in the paper they are conv1_1, conv2_1, conv3_1, conv4_1, but in your code they seem to be relu layers
It seems the conv1 in splating block is shared, while in your code you use two conv layer for content and style respectively. Also, my result after 50 epochs is similar to your result after 2 epochs, which are gray. Can you give me any suggestions on this? Thank you!!

Thank you for sharing the program, it helps a lot. BTW, I found several detail that are different from the paper in this implementation:

The VGG extraction layer, in the paper they are conv1_1, conv2_1, conv3_1, conv4_1, but in your code they seem to be relu layers

It seems the conv1 in splating block is shared, while in your code you use two conv layer for content and style respectively. Also, my result after 50 epochs is similar to your result after 2 epochs, which are gray. Can you give me any suggestions on this? Thank you!!

Thanks for your question.

About the VGG feature, I just directly took the code from this project https://github.com/naoto0804/pytorch-AdaIN. But you can change them to see how it is going.
This is also annoying me. When you choose a small set of pictures as style images (such as 7 images), the result will be better. I still can not figure out why, though I just follow the original paper.

Thank you for sharing the program, it helps a lot. BTW, I found several detail that are different from the paper in this implementation:

The VGG extraction layer, in the paper they are conv1_1, conv2_1, conv3_1, conv4_1, but in your code they seem to be relu layers

It seems the conv1 in splating block is shared, while in your code you use two conv layer for content and style respectively. Also, my result after 50 epochs is similar to your result after 2 epochs, which are gray. Can you give me any suggestions on this? Thank you!!

1). I emailed to the author and they claim that they adopted the conv1-1, conv2-1, conv3-1, conv4-1 for content loss and conv4-1 for style loss. I tried this, but found that the results are unpleassing. So, I used the features after relu activation to calulate the losses, which improved a lot.

2)I implemented the splatting block using the offical instruction of the paper, which can be written below:

class SplattingBlock(nn.Module):
    def __init__(self, in_channels, out_channels, shortcut_channel):
        super(SplattingBlock, self).__init__()
        self.conv1 = ConvLayer(in_channels, out_channels, kernel_size=3, stride=2)
        self.conv2 = ConvLayer(shortcut_channel+out_channels, out_channels, kernel_size=3, stride=1)
        return
    def forward(self, c, s, shortcut):
        c = F.relu(self.conv1(c))
        s = F.relu(self.conv1(s))
        c = adaptive_instance_normalization(c, s)
        c = torch.cat([c, shortcut], dim=1)
        c = F.relu(self.conv2(c))
        return c, s

This doesn't matter in my opinion. And the performance is similar to the reimplementation by @mousecpn .

And, The training results shows: 1). For the three-dimension laplacian loss function, maybe the constraints is a little strong, which causes that the generated results have somewhat style under some scenes.
2). Replacing it using two-dimension laplacian loss function, and changing the weight of style loss in relu1-1, the generated style are better improved. 3). I also tried the total varition loss, which can manage this two.

Note that, the trained model performs well under some circumstance, but it also failed for some circumstance. Maybe the my training process remains to improve and more augumentaion tricks are needed.

mousecpn / Joint-Bilateral-Learning

Questions about the performance #1