Open lebeli opened 2 years ago
The entire pipeline could have been implemented using any of the available CLIP models, or a mix thereof. Setting the weight of the ViT-B/16 CLIP model to 0.0 just means that it did not contribute to any of the loss / layer selection calculations. The other CLIP model (ViT-B/32) would still be used, and you could simply rank the layers according to its output (rather than a weighted sum of the outputs of several CLIP models).
The instances where adaptive layer selection is off are the instances where the number of selected layers is the same as the number of W-code inputs for the model (e.g. 18 for the 1024x1024 FFHQ model).
It does make sense to use values other than 1.0 and 0.0. Each CLIP model leads to different visual effects. Using models with smaller patch sizes (16, 14) leads to better identity preservation. Using larger models (32) typically leads to better representation for styles etc. You can use values between 1.0 and 0.0 in order to interpolate between these preferences and decide how much importance you want to place on each.
If you're only using one CLIP model, then you are correct that you may as well just use 1.0 or 0.0 and play with the scale of the loss instead.
I see, there was a misunderstanding on my part. So for both, the global and directional loss, you use the same two CLIP models (ViT-B/32 and ViT-B/16)? And for both losses you sum the individual CLIP losses from ViT-B/32 and ViT-B/16?
Edit: One last question. Does the big CLIP Model focus more on global features and the smaler one more on local features? Or what is the difference?
Every place where we use CLIP, we use the same weighted combination of the two models, yes. In practice, for many of our results (as you saw in the supp table), we set the weight of one of the models to 0, which effectively means we used just one model.
The 32/B model has larger patch sizes, and it focuses less on local content and more on global things like style. The 16/B model helps somewhat when you want to improve smaller scale attributes like shape. There's also a 14/L model, but it almost always makes the results worse :) You can add it to help improve identity preservation, but you'll probably want to give it a low weight.
Hello,
in you paper under Appendix I Table 3 you list different hyperparameter combinations. For the ViT-B/16 CLIP model you vary weighting it with 1.0 and with 0.0. Does a weight of 0.0 mean turning the adaptive layer selection off? If so, wouldn't different values for auto_layer_k be useless when using [1.0, 0.0] for clip_model_weights? My thinking is: if you weigh the global loss with 0, the w codes will remain untouched. This means, that you cannot rank the importance of the corresponding layers and thus not select any layers at all.
In the same vein, does it make sense to use values other than 1.0 or 0.0 for the clip_model_weights? I would say no, because it would effectively just be another way to influence the learning rate. Or am I missing something?
Thank you!