@kovalexal In the paper, the de-coupled cross-attention allows text and image to go through different Linear layers respectively, and then perform cross-attention and add the results. However, in the code implementation that it is directly concatenated and passed to the Unet. as follow:
https://github.com/tencent-ailab/IP-Adapter/blob/5a18b1f3660acaf8bee8250692d6fb3548a19b14/tutorial_train.py#L118
Could you pls explain this implementation details.
@kovalexal In the paper, the de-coupled cross-attention allows text and image to go through different Linear layers respectively, and then perform cross-attention and add the results. However, in the code implementation that it is directly concatenated and passed to the Unet. as follow: https://github.com/tencent-ailab/IP-Adapter/blob/5a18b1f3660acaf8bee8250692d6fb3548a19b14/tutorial_train.py#L118 Could you pls explain this implementation details.