huanngzh / Parts2Whole

[Arxiv 2024] From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation
https://huanngzh.github.io/Parts2Whole/
MIT License
172 stars 7 forks source link

关于encoder的疑问 #2

Closed BugsMaker0513 closed 5 months ago

BugsMaker0513 commented 5 months ago

是否可以更详细的解释一下encoder,如何“Instead of simply piecing multiple reference images together, we pass the latent features of different parts through the appearance encoder one by one and provide a textual class label for each part.”?

谢谢

huanngzh commented 5 months ago

Yes, we will open source it as soon as possible (see #1 ).

Instead of simply piecing multiple reference images together, we pass the latent features of different parts through the appearance encoder one by one and provide a textual class label for each part.

"Simply piecing multiple reference images together" means stitching multiple reference images into one long image and inputting it into the appearance encoder, but this will bring huge memory requirements, and pretrained unet's understanding of long images is not common. Our approach is to pass each reference image through the encoder one by one, and when different reference images pass through the encoder, different text labels like "hair", "face", etc. will be injected to give corresponding semantic effects.

Hope this can solve your doubts.

BugsMaker0513 commented 5 months ago

Yes, we will open source it as soon as possible (see #1 ).

Instead of simply piecing multiple reference images together, we pass the latent features of different parts through the appearance encoder one by one and provide a textual class label for each part.

"Simply piecing multiple reference images together" means stitching multiple reference images into one long image and inputting it into the appearance encoder, but this will bring huge memory requirements, and pretrained unet's understanding of long images is not common. Our approach is to pass each reference image through the encoder one by one, and when different reference images pass through the encoder, different text labels like "hair", "face", etc. will be injected to give corresponding semantic effects.

Hope this can solve your doubts.

这是训练阶段的做法吗?那请问推理的时候具体怎么做呢?

BugsMaker0513 commented 5 months ago

Yes, we will open source it as soon as possible (see #1 ).

Instead of simply piecing multiple reference images together, we pass the latent features of different parts through the appearance encoder one by one and provide a textual class label for each part.

"Simply piecing multiple reference images together" means stitching multiple reference images into one long image and inputting it into the appearance encoder, but this will bring huge memory requirements, and pretrained unet's understanding of long images is not common. Our approach is to pass each reference image through the encoder one by one, and when different reference images pass through the encoder, different text labels like "hair", "face", etc. will be injected to give corresponding semantic effects.

Hope this can solve your doubts.

主要的疑惑是在这个for循环内:https://github.com/huanngzh/Parts2Whole/blob/main/parts2whole/pipelines/pipeline_refs2image.py#L1117 key和value多次注入了 self.reference_unet,但是并没有从 self.reference_unet中取出什么东西?是 self.reference_unet里把一些信息自动存储在reference_control_writer的bank中了吗?

huanngzh commented 5 months ago

Right! Each feature of reference image will be appended to the bank.

BugsMaker0513 commented 5 months ago

请问使用这种方案,生成的图像的人脸id保持得怎么样?

huanngzh commented 5 months ago

In our tests, facial identity was maintained well with our Parts2Whole. However, to be honest, if we only look at the face as the single condition input, lora is currently a relatively stable and better generalization method for maintaining face ID.

BugsMaker0513 commented 5 months ago

In our tests, facial identity was maintained well with our Parts2Whole. However, to be honest, if we only look at the face as the single condition input, lora is currently a relatively stable and better generalization method for maintaining face ID.

非常感谢您的回复。您提到lora能做到比较好的id保持能力,这块我接触的比较少,请问可以推荐一两个lora相关的方案吗?

huanngzh commented 5 months ago

Maybe you need it: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md

BugsMaker0513 commented 5 months ago

请问parts2whole/models/custom_attention_processor.py有被使用吗?

huanngzh commented 5 months ago

custom_attention_processor.py defines methods used in decoupled cross attention, with both image and text as input. You can see the following code in our inference code if you set use_decoupled_cross_attn is True:

set_unet_2d_condition_attn_processor(
      unet,
      set_cross_attn_proc_func=lambda n, hs, cd, ori: DecoupledCrossAttnProcessor2_0(
          hidden_size=hs, cross_attention_dim=cd, max_image_length=6
      ),
  )
BugsMaker0513 commented 5 months ago

custom_attention_processor.py defines methods used in decoupled cross attention, with both image and text as input. You can see the following code in our inference code if you set use_decoupled_cross_attn is True:

set_unet_2d_condition_attn_processor(
      unet,
      set_cross_attn_proc_func=lambda n, hs, cd, ori: DecoupledCrossAttnProcessor2_0(
          hidden_size=hs, cross_attention_dim=cd, max_image_length=6
      ),
  )

谢谢回复!非常期待早日release训练code~

xujinbao282 commented 3 months ago

@huanngzh

In our tests, facial identity was maintained well with our Parts2Whole. However, to be honest, if we only look at the face as the single condition input, lora is currently a relatively stable and better generalization method for maintaining face ID.

请问该模型支持和LoRA结合使用吗?

huanngzh commented 3 months ago

It can be used with LoRA, but due to the limited training data, the generalization is lacking. Guess that the generated result with LoRA will be relatively poor. We will do further research on generalization and extensibility. Stay tuned!