Closed theAdamColton closed 1 year ago
This is something I also never quite understood about the original paper, nor have I explored myself. So I can't really answer this question. It could be something as naive as passing a zero tensor as a substitute for the lower level codes in the final decoder, but only the original authors know.
Thanks for bringing this to my attention, I am currently working on a refactor of this repo (see #5) so I might investigate this once that is done. There are actually quite a lot of unclear things in the paper that we may never know for sure how it was done for the paper.
The original paper had this cool graphic in it, which showed what I believe is a decoded representation of different parts of the network. But I don't understand how in practice you could obtain a decoded image using only the top level FFHQ encoder representation. In the case of the three level FFHQ model, the final decoder layer is applied to a concatenation of the upscaled middle layer and the double upscaled top layer, and expects 192 layers.
Is there a way, only using information from the top level encoded quantized representation, to get an image out of the network?