martinarjovsky / WassersteinGAN

BSD 3-Clause "New" or "Revised" License
3.2k stars 725 forks source link

Inconsistent loss function from the paper? #72

Open realcrane opened 5 years ago

realcrane commented 5 years ago

Hi,

I don't use torch a lot and I have a question regarding the implementation of the discriminator loss at line 211, errD = errD_real - errD_fake where errD_real is the gradient of real samples (line202: errD_real.backward(one)) and errD_fake is the gradient of fake samples (errD_fake.backward(mone)). However, in the paper it seems that errD needs to be maximized while it is minimized here?

Thanks

npmhung commented 5 years ago

I have the same question as you, but as I guess that because the code change the sign of both G and D loss, the discriminator just instead gives high score for fake images and low score for real images.

Intuitively, you are just changing the "label" of real and fake images.

I dont know if this is correct. Can anyone confirm this?

I found another implementation that does exactly stated in the paper: link

realcrane commented 5 years ago

@npmhung , thanks for the link. I had a look at it. But there are also some other questions regarding this one. For instance, it seems that there Sigmoid has been used as the last layer for both D and G whereas in the original paper it kinda suggests that it is not? The way I understand WGAN is the clipping constrains the values, not some activation such as Sigmoid. There might be numerical consequences if activation is used but it is unclear to me either.

npmhung commented 5 years ago

Can you specify the line that they use sigmoid? It seems that I can't find one.

khoadoan commented 4 years ago

I don't think the use of sigmoid is necessary (at least for the critic net, since this is only to output the score). If sigmoid is used for D, it'll even make training slower since gradient will saturate when D is more and more correct. For G, one can use both sigmoid or tanh to output the generated samples, but tanh is better for learning in my opinion. Also, the loss functions for both D and G are the reverses of what is discussed in the paper. However, since they reverse both losses, it kinda turns out to be correct (as explained in the label flipping explanation above).

ALLinLLM commented 3 years ago

after debug a lot I found this code is not wrong, but confusing

                errD_real = netD(inputv)
                errD_real.backward(one)

                # train with fake
                noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1)
                noisev = Variable(noise, volatile = True) # totally freeze netG
                fake = Variable(netG(noisev).data)
                inputv = fake
                errD_fake = netD(inputv)
                errD_fake.backward(mone)
                errD = errD_real - errD_fake

notice that mone here is defined as -1*one

In fact, the loss in the paper is: 图片

so, just do backward() on the loss_d and loss_g is more easy to understand:

#for D
loss_d = errD_real.mean() - errD_fake.mean()
loss_d.backward()
#for G
loss_g = - errD_fake.mean()
loss_g.backward()
zzachw commented 3 years ago

after debug a lot I found this code is not wrong, but confusing

                errD_real = netD(inputv)
                errD_real.backward(one)

                # train with fake
                noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1)
                noisev = Variable(noise, volatile = True) # totally freeze netG
                fake = Variable(netG(noisev).data)
                inputv = fake
                errD_fake = netD(inputv)
                errD_fake.backward(mone)
                errD = errD_real - errD_fake

notice that mone here is defined as -1*one

In fact, the loss in the paper is: 图片

so, just do backward() on the loss_d and loss_g is more easy to understand:

#for D
loss_d = errD_real.mean() - errD_fake.mean()
loss_d.backward()
#for G
loss_g = - errD_fake.mean()
loss_g.backward()

I think the loss for G should be:

#for G
loss_g = errD_fake.mean()
loss_g.backward()

as the sign changed in line 6 and 11 in the paper.

SuperbTUM commented 2 years ago

Does it mean the loss is actually has nothing to do with the label? Like -1 for a fake image.

SuperbTUM commented 2 years ago

after debug a lot I found this code is not wrong, but confusing

                errD_real = netD(inputv)
                errD_real.backward(one)

                # train with fake
                noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1)
                noisev = Variable(noise, volatile = True) # totally freeze netG
                fake = Variable(netG(noisev).data)
                inputv = fake
                errD_fake = netD(inputv)
                errD_fake.backward(mone)
                errD = errD_real - errD_fake

notice that mone here is defined as -1*one In fact, the loss in the paper is: 图片 so, just do backward() on the loss_d and loss_g is more easy to understand:

#for D
loss_d = errD_real.mean() - errD_fake.mean()
loss_d.backward()
#for G
loss_g = - errD_fake.mean()
loss_g.backward()

I think the loss for G should be:

#for G
loss_g = errD_fake.mean()
loss_g.backward()

as the sign changed in line 6 and 11 in the paper.

Have you tried to modify the code in such a way? For me, this is incorrect because it will cause some problems in gradient propagation for the discriminator. I think this is due to the definition of activation functions since they are set as inplace=True. One possible solution to this is to set them to False, but I haven't tried yet.