question about training diffusion-inpainting model

D222097 commented 6 months ago

Hi, everyone! I'm struggling with the inpainting/outpainting, and quite confused about the input of the model in the training stage, hope can get some help😢

In diffusers/examples/research_projects/multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py @gzguevara and diffusers/examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint.py, the input of 9-ch inpainting model are the combination of gt(add noise), masked_img, and mask during training.	gt imgs	masked imgs	mask

I am curious about why GT image can be input into the unet directly. Even though it has been added with noise, it is still visible to the unet.

latents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample()
latents = latents * vae.config.scaling_factor

masked_latents = vae.encode(batch["masked_images"].reshape(batch["pixel_values"].shape).to(dtype=weight_dtype)).latent_dist.sample()
masked_latents = masked_latents * vae.config.scaling_factor

masks = batch["masks"]
mask = torch.stack([torch.nn.functional.interpolate(mask, size=(args.resolution // 8, args.resolution // 8)) for mask in masks])
mask = mask.reshape(-1, 1, args.resolution // 8, args.resolution // 8)

noise = torch.randn_like(latents)
bsz = latents.shape[0]
timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bsz,), device=latents.device)
timesteps = timesteps.long()
noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

latent_model_input = torch.cat([noisy_latents, mask, masked_latents], dim=1)

Use the images above as an example: the input is car image, and the expected output is car image during training. And when comes for infering, users can add objects to the image(eg. the input is image unrelated to car(arbitrary object or just background), and the expected output is car image.) There is a gap between training and testing.

I think it may benefits from text-guided effect, but I still have doubts. On the one hand, model needs GT to be optimized, and it is often used as a target in other generative model, rather than as a direct input to the model. On the other hand, diffusion model predict Gaussian noise, there seems to be no other way for diffusion model to be constrained from gt.

When turns to outpainting task, the gap between training and testing is bigger: If I use the combination of gt(add noise), masked_img, and mask for training, waht should I pad the image with unmasked area for infering.I don't understand how does the model avoid learning a simple mapping, I'd be grateful if anyone could give me advice.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yiyixuxu commented 5 months ago

Hi: you can use https://github.com/huggingface/diffusers/discussions for questions!

YiYi

BeALigh commented 4 months ago

Hi,I'm trying to train lora with diffusers/examples/research_projects/multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py,and I referred to diffusers/examples/research_projects/dreambooth_inpaint/README.md, but I encoutered some problems, such as

  File "/ai/data/diffusers/examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint_lora.py", line 834, in <module>
    main()
  File "/ai/data/diffusers/examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint_lora.py", line 716, in main
    for step, batch in enumerate(train_dataloader):
  File "/ai/data/anaconda3/envs/lora/lib/python3.10/site-packages/accelerate/data_loader.py", line 384, in __iter__
    current_batch = next(dataloader_iter)
  File "/ai/data/anaconda3/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/ai/data/anaconda3/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/ai/data/anaconda3/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/ai/data/anaconda3/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/ai/data/diffusers/examples/research_projects/dreambooth_inpaint/train_dreambooth_inpaint_lora.py", line 365, in __getitem__
    example["instance_prompt_ids"] = self.tokenizer(
  File "/ai/data/anaconda3/envs/lora/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2823, in __call__
    raise ValueError("You need to specify either `text` or `text_target`.")
ValueError: You need to specify either `text` or `text_target`.

And diffusers==0.27.0.dev0. I don't know how to deal with it,I'd be grateful if anyone could give me advice.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers

question about training diffusion-inpainting model #6502