You used eg3d based triplane feature grid. From the public eg3d code repo, I found there is 2 inputs for mapping network. one is z, another is conditional variable c ( in the eg3d, condition is 25-dimensional vector that represents camera parameters).
I wonder that in your model, you feed just global averaged 1D feature vector to the mapping network alone, or used other design(e.g. z is random vector and 1D feature vector is conditional variable)?
How did you applied LPIPS loss on the rendering outputs? Do you render image patches during training?
we use a pre-trained Resnet18 Backbone to extract 512-dimensional vector as z and do not use c in our case.
we directly render the whole image by utilizing human prior and then apply LPIPS on it. Based on previous experience on NeRF training, if you render image patches, similar performance should be achieved.
Hello, thanks for the good work.
I wonder that in your model, you feed just global averaged 1D feature vector to the mapping network alone, or used other design(e.g. z is random vector and 1D feature vector is conditional variable)?