Flattened image results on Cats dataset

hsi1032 commented 1 year ago

Hi, I met some issues when training your model on Cats dataset.

I used Cats dataset which you uploaded with 128x128 resolution with default configuration following "cats_aligned.yaml" For 128 resolution, I resize the image by PIL.Image.resize function with lanzcos resampling filter, and do not change any values in camera pos label in "dataset.json". What I only changed is the resolution of triplane, which has a default setting of 512 to 256. (by changing "configs/model/epigraf.yaml")

The results I got are as follows:

https://user-images.githubusercontent.com/28529188/203942709-915c6993-301a-4311-8958-fb2292ae4506.mp4

Can you give me some insights why model failed to learn the volume density?

thanks,

universome commented 1 year ago

Hi @hsi1032 , could you please tell which command you use to launch training? Such things happen when one does not enable camera conditioning in the discriminator

hsi1032 commented 1 year ago

Hi, thanks for your reply.

I found my configuration file indicates that I do not use the camera pose condition on both G and D. (generator: camera_cond: False / discriminator: camera_cond: False)

Then, I'm curious that there is no way to avoid these "flattened results" without the camera pose conditioning. As I knew, the previous method such as pi-GAN works without this conditioning technique, so I wonder why the model results in this flattened results. (maybe is this the tendency of tri-plane generator to synthesize flattened results?)

I attach my configuration of the generator and discriminator in "experiments/{experiments_name}/experiment_cofig.yaml" below.

model: generator: fmaps: 1.0 cmax: 512 cbase: 32768 optim: betas:

0.0

0.99 patch: ${training.patch} dataset: ${dataset} w_dim: 512 camera_cond: false camera_cond_drop_p: 0.0 camera_cond_noise_std: 0.0 camera_cond_spoof_p: 0.5 map_depth: 2 backbone: stylegan2 num_ray_steps: 48 clamp_mode: softplus nerf_noise_std_init: 1.0 nerf_noise_kimg_growth: 5000 use_noise: true tri_plane: res: 256 feat_dim: 32 fp32: true view_hid_dim: 0 posenc_period_len: 0 mlp: n_layers: 2 hid_dim: 64 bg_model: type: null output_channels: 4 coord_dim: 4 num_blocks: 2 cbase: 32768 cmax: 128 num_fp16_blocks: 0 fmm: enabled: false rank: 3 activation: demod posenc_period_len: 64.0 num_steps: 8 start: 1.0 discriminator: fmaps: 0.5 cmax: 512 cbase: 32768 patch: ${training.patch} num_additional_start_blocks: 1 mbstd_group_size: 4 camera_cond: false camera_cond_drop_p: 0.0 camera_cond_noise_std: 0.0 hyper_mod: true optim: lr: 0.002 betas:

0.0

0.99 loss_kwargs: pl_weight: 0.0 blur_init_sigma: 10 blur_fade_kimg: 200 name: epigraf

universome commented 1 year ago

Hi @hsi1032, I am sorry for replying so late... There are several factors which together contribute to flatness: 1) tri-plane representation (see also Figure 4 in EG3D or Figure 4 in GMPI); 2) narrow camera distribution; and 3) using infinite depth in volume rendering. The latter one leads to flattened results since the generator now has a much easier solution of drawing everything on the back of the viewing frustum. If you remove one of the factors, then the things will get biased towards better geometry, depending on how wide your camera distribution is. For example, we didn't need any camera conditioning when training on Megascans, because it had good enough camera coverage.

hsi1032 commented 1 year ago

I greatly appreciate your kind answer.

Thanks,

universome / epigraf

Flattened image results on Cats dataset #12