Open CDitzel opened 3 years ago
I argee with you. So I temporarily set kl_weight to zero. Otherwise the recon_loss cannot be reduced. In this version, kl_loss is contradict to recon_loss.
maybe someone can email the paper authors to see if this loss was used at all?
I must have been used as they mention an increasing weight parameter in the paper.
Still, I am trying, but I cant seem to figure out his e mail adress. On the paper it says
Aditya Ramesh <_@adityaramesh.com
so I tried Aditya_Ramesh@adityaramesh.com, Aditya.Ramesh@adityaramesh.com
but they dont exist...
Has anyone had any more insights/updates on this? I'm running into the exact same issue (on an independent DALL-E repro) and bashing my head against the wall trying to understand the behaviour!
in the train_vae script the kl_loss is set to zero via the weight parameter and also in my elaborate runs of experiments, I found that including the KL term does more harm than it helps. @karpathy also mentioned trouble in getting it to work properly.
did anyone achieve any progress on this matter?
Also, this
https://github.com/lucidrains/DALLE-pytorch/blob/7658e60f3a15c74ce96ac5cb661e7dd7101b50b6/dalle_pytorch/dalle_pytorch.py#L196
rather use the soft_one_hot values than the raw logits?
Also, I find it a little confusing that we are actually annealing the temperature of gumbel-softmax, thus steering the it towards one_hot sampling when at the same time we are trying to encourage the distribution to be close to a uniform prior. Isnt this a contradiction?