lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.56k stars 644 forks source link

Kl Loss correction #99

Open CDitzel opened 3 years ago

CDitzel commented 3 years ago

https://github.com/lucidrains/DALLE-pytorch/blob/995bfe1789243cbc838943cdc748daab406aae3e/dalle_pytorch/dalle_pytorch.py#L195

I am fairly certain that this should instead read

logits = rearrange(logits, 'b n h w -> (b h w) n')

since we are summing over the latent dimension, i.e. the probs/encoder outputs and averaging over the obversations. i.e. every spatial dim separately and for every sample/batch. The docs are a little messy on this but from what I understand batchmean requires a reshaping in the sense that all examples are condensed into the batch dimension

lucidrains commented 3 years ago

@CDitzel ohh got it, even if this were an error, that would just mean you could compensate by adjusting kl_div_loss_weight by some factor on an order of magnitude of?

lucidrains commented 3 years ago

@CDitzel but yes, I have noticed that with this loss present, the network doesn't learn that well :(

lucidrains commented 3 years ago

@CDitzel rumor has it that the author of DALL-E was asked about this loss, but didn't give any straight answers

CDitzel commented 3 years ago

I have a headache Phil. The math demands this term to be there but when it is present, the results are actually worse. Hate it...

CDitzel commented 3 years ago

@CDitzel ohh got it, even if this were an error, that would just mean you could compensate by adjusting kl_div_loss_weight by some factor on an order of magnitude of?

Im not sure. I keep getting kl losses below zero oO

robvanvolt commented 3 years ago

Maybe we could write the Open-AI team and ask for a straight answer? Maybe they disclose the information within a secure two-person email conversation?

afiaka87 commented 3 years ago

@CDitzel rumor has it that the author of DALL-E was asked about this loss, but didn't give any straight answers

Yeah i've seen the video of it. He's somewhat dodgy the moment he's asked about it. I can't attest to the rigor of the math, however.

CDitzel commented 3 years ago

KL must have been used as they mention an increasing weight parameter in the paper.

Still, I am trying, but I cant seem to figure out his e mail adress. On the paper it says

Aditya Ramesh <_@adityaramesh.com

so I tried Aditya_Ramesh@adityaramesh.com, Aditya.Ramesh@adityaramesh.com

but they dont exist...

CDitzel commented 3 years ago

Even without including the kl term, I am wondering if anyone else has observed the following:

If I train the dVAE circumventing the GumbelSoftmax, i.e. the last 1x1 conv encoder layer output is directly multiplied with the codebook, than reconstruction is almost immediately very good.

Whereas when gumbel is used in between those two steps, then the output becomes very blurry and not at all comparable in quality.

CDitzel commented 3 years ago

so did anyone find out his email address? I composed an email but dont know where to send it to

afiaka87 commented 3 years ago

so did anyone find out his email address? I composed an email but dont know where to send it to

:shrug: nope. you really think he'll talk if OpenAI didnt want him to in the first place? Is it not sort of their MO to keep things just vague enough?

CDitzel commented 3 years ago

I have no idea what going on with OpenAI, but so far I dont think they are open to transparent research as their name would suggest...