A confusing question in transformer training

forever-rz commented 1 year ago

Thanks for your contribution,but there is a problem when i train it on FFHQ.Once the ratio of mask is larger, it seems that only part of the completed result is repaired, the part that is not repaired stays black and no new content seems to be generated. Is this normal? eg1: First on the left, second on the left remain some strange black regions .(epoch:12) mask_input completed reconstruction

forever-rz commented 1 year ago

This bad visual effect became increasingly apparent later in the training. eg2: epoch:19 mask_input completed reconstruction input

liuqk3 commented 1 year ago

Hi @forever-rz , Thanks for your interests. It seems that the second codebook is used for quantization while training transformer. For FFHQ, it will not take so long to get a reasonable inpainting results.

forever-rz commented 1 year ago

@liuqk3 Thanks for your help! Can you tell me how should I use just one codebook ? I only modified three parts of the transformer training configuration( batch_size,sample_iterations,save_epochs), the rest was followed by the pre-trained model configuration.This is really confusing to me.

forever-rz commented 1 year ago

My configuration is shown below： dataloader: batch_size: 16 data_root: data num_workers: 1 train_datasets:

params: im_preprocessor_config: params: horizon_flip: true random_crop: true size:
- 256
- 256 smallest_max_size: 256 target: image_synthesis.data.utils.image_preprocessor.SimplePreprocessor image_list_file: data/ffhqtrain_69k.txt mask: 1.0 mask_low_size:
  - 32
  - 32 mask_low_to_high: 0.1 multi_image_mask: false name: ffhq provided_mask_name: irregular-mask/testing_mask_dataset return_data_keys:
  - image
  - mask stroken_mask_params: keep_ratio:
  - 0.3
  - 0.6 maxBrushWidth: 30 maxLength: 100 maxVertex: 10 minBrushWidth: 10 minVertex: 5 min_area: 64 use_provided_mask: 0.8 use_provided_mask_ratio:
  - 0.3333333
  - 1.0 target: image_synthesis.data.image_list_dataset.ImageListDataset validation_datasets:
params: im_preprocessor_config: params: size:
- 256
- 256 smallest_max_size: 256 target: image_synthesis.data.utils.image_preprocessor.SimplePreprocessor image_list_file: data/ffhqvalidation_1k.txt mask: 1.0 mask_low_size:
  - 32
  - 32 mask_low_to_high: 0.0 multi_image_mask: false name: ffhq provided_mask_name: irregular-mask/testing_mask_dataset return_data_keys:
  - image
  - mask
  - relative_path stroken_mask_params: keep_ratio:
  - 0.3
  - 0.6 maxBrushWidth: 30 maxLength: 100 maxVertex: 10 minBrushWidth: 10 minVertex: 5 min_area: 64 use_provided_mask: 1.0 #TODO0.8 use_provided_mask_ratio:
  - 0.4
  - 1.0 target: image_synthesis.data.image_list_dataset.ImageListDataset

model: target: image_synthesis.modeling.models.masked_image_inpainting_transformer_in_feature.MaskedImageInpaintingTransformer params: n_layer: 30 content_seq_len: 1024 n_embd: 512 n_head: 8 num_token: 512 embd_pdrop: 0.0 attn_pdrop: 0.0 resid_pdrop: 0.0 attn_content_with_mask: False mlp_hidden_times: 4 block_activate: GELU2 random_quantize: 0.3 weight_decay: 0.01 content_codec_config: target: image_synthesis.modeling.codecs.image_codec.patch_vqgan.PatchVQGAN params: ckpt_path: OUTPUT/pvqvae_ffhq/checkpoint/last.pth trainable: False token_shape: [32, 32] combine_rec_and_gt: True quantizer_config: target: image_synthesis.modeling.codecs.image_codec.patch_vqgan.VectorQuantizer params: n_e: 1024 e_dim: 256 masked_embed_start: 512 embed_ema: True get_embed_type: retrive distance_type: euclidean encoder_config: target: image_synthesis.modeling.codecs.image_codec.patch_vqgan.PatchEncoder2 params: in_ch: 3 res_ch: 256 out_ch: 256 num_res_block: 8 res_block_bottleneck: 2 stride: 8 decoder_config: target: image_synthesis.modeling.codecs.image_codec.patch_vqgan.PatchConvDecoder2 params: in_ch: 256 out_ch: 3 res_ch: 256 num_res_block: 8 res_block_bottleneck: 2 stride: 8 up_layer_with_image: true encoder_downsample_layer: conv solver: adjust_lr: none base_lr: 0.0 find_unused_parameters: false max_epochs: 250 optimizers_and_schedulers:

name: transformer optimizer: params: betas: !!python/tuple
- 0.9
- 0.95 target: torch.optim.AdamW scheduler: params: min_lr: 1.0e-05 warmup: 2000 warmup_lr: 0.0003 step_iteration: 1 target: image_synthesis.engine.lr_scheduler.CosineAnnealingLRWithWarmup sample_iterations: 400 save_epochs: 1 validation_epochs: 1

forever-rz commented 1 year ago

The strange thing is that when the mask ratio is small, like in the green circle, there are no such problems and only one codebook seems to be used, so why is it possible that two codebooks are used when the mask ratio is large (in the red circle)? What's wrong with my settings? eg epoch:8 mask_input completed

forever-rz commented 1 year ago

@liuqk3 Thanks for the reply, but I've carefully compared the posted model to my model training process and really don't notice any difference. So was wondering if it could be the parameters? In the training phase of pvqvae keep_ratio:[0.0,0.5] while the transformer training phase keep_ratio:[0.3,0.6] causes this?

liuqk3 commented 1 year ago

Hi, @forever-rz . Sorry for the delayed reply.

keep_ratio just affects the number of remained pixels in an image, it should not cause such as artifacts. After have a loot at your configs, I do not find something wrong. Here is my questions or suggestions: 1) Did you use our provided P-VQVAE or train it by yourself? 2) Have you checked the reconstruction capability of the used P-VQVAE? 3) Can you provide the cross-entropy loss curves of transformer?

forever-rz commented 1 year ago

Hi, @liuqk3 .I'm sorry that I temporarily put aside the experiment because I couldn't figure out the cause of this problem. Today, I carefully checked the previous experiment and found the relevant data of the three questions you raised as follows. 1 Instead of using the provided P-VQVAE, I trained a new P-VQVAE model companion and added some attention blocks. 2 The reconstruction results of my PVQVAE model are as follows FFHQ (a)input 27_input (b) mask 27_mask (c) reference_input 27_reference_input (d)reconstruction 27_reconstruction

PLACES2 (a)input input (b) mask mask (c) reference_input 64_reference_input (d)reconstruction 64_reconstruction IMAGENT (a)input 29_input (b) mask 29_mask (c) reference_input 29_reference_input (d)reconstruction 29_reconstruction

3 The loss of training transformer based on my P-VQVAE is as follows (ImageNet is too big, so it is given up) FFHQ: Places2

forever-rz commented 1 year ago

It seems to me that the reconstruction results are not bad, so I don't understand why transformer's training results are so wrong. Although I made some changes to P-VQVAE, transformer has not changed at all.

liuqk3 commented 1 year ago

@forever-rz ， I do not know the number of epochs you have trained on FFHQ and Places2. You can try to visualize the inpainting results of the trained model.

myhansu commented 1 year ago

Hi, do you have any updates on this issue? I have also encountered same problem with a custom dataset, reconstruction results are much better than completed results.

UESTC-Med424-JYX commented 1 month ago

I also encountered the problem of good reconstruction effect but poor generation effect in a similar task. I saw that our loss curves are basically the same. Did you solve this problem in the end?

liuqk3 / PUT

A confusing question in transformer training #15