Basic understanding of the IP adapter during image generation

Hey everyone, I'm trying to understand the IP adapter better. Maybe someone can help me:)

Paper:

https://arxiv.org/pdf/2308.06721.pdf

Would it be right to say:

1)An IP adapter model(e.g. ip-adapter_sdxl.bin) consists of a projection network(linear layer and normalization layer) and adapted modules(with decoupled cross attention)? 2) The modules marked in red in the image represent the function of the IP adapter model (e.g. ip-adapter_sdxl.bin) in the image generation process?

Maybe you can tell, I have no background in machine learning. I work with ComfyUI and read the paper out of interest. But linear algebra is not unknown if it gets mathematical :) fig

tencent-ailab / IP-Adapter

Basic understanding of the IP adapter during image generation #377