Key Difference in Optimization Between Implementation and Paper

Zhazhan commented 3 days ago

While reviewing the implementation in this repository, I noticed a critical discrepancy between the optimization method described in the paper and the code regarding the role of the LBFGS optimizer. Specifically, in the _on_afterbackward function located in ldm/models/diffusion/ddpm.py, the gradient descent on textual_inv_embedding is carried out based on the gradients of projected_textual_invembedding. It is important to highlight that the on_after_backward function is invoked prior to the optimizer.step()_.

Additionally, the optimizer's learnable parameters are set as _projected_textual_inv_embedding. However, in the fast_projection_ method within ldm/modules/encoders/localcliptransformer.py, the value of projected_textual_inv_embedding is reset to the nearest neighbor embedding of textual_inv_embedding from the vocabulary at every forward call_. This means that, throughout the optimization process, the optimization of projected_textual_inv_embedding does not seem to have any practical effect on textual_inv_embedding.

Considering these observations, could we infer that simple gradient descent combined with the Straight-Through Estimator (STE) technique is sufficient to achieve desired outcomes in hard prompt optimization?

Looking forward to your insights on this matter.

s-mahajan commented 2 days ago

Hello, thanks for the curious question! The code is in-line with that shown in the paper. The on_after_backward() gives us the gradient of the projected_textual_inv_embedding which we utilize to update the textual_inv_embedding. The optimizer.step() that is invoked after would update the projected_textual_inv_embedding with projected_textual_inv_embedding.grad and we use the same gradient to update textual_inv_embedding. Therefore, the gradients of projected_textual_inv_embedding impact the textual_inv_embedding updates at every step. And therefore, it is not a simple descent with STE.

Hope this helps with the clarification, Best,

Zhazhan commented 2 days ago

Hello, thanks for the curious question! The code is in-line with that shown in the paper. The on_after_backward() gives us the gradient of the projected_textual_inv_embedding which we utilize to update the textual_inv_embedding. The optimizer.step() that is invoked after would update the projected_textual_inv_embedding with projected_textual_inv_embedding.grad and we use the same gradient to update textual_inv_embedding. Therefore, the gradients of projected_textual_inv_embedding impact the textual_inv_embedding updates at every step. And therefore, it is not a simple descent with STE.

Hope this helps with the clarification, Best,

Thank you for your detailed response and for your commendable work! I may still have some misunderstandings about the entire training process, particularly regarding the updates to textual_inv_embedding. I'd appreciate further clarification. Here's my current understanding of the training procedure:

Forward: The value of _projected_textual_invembedding is set to the nearest neighbor embedding of _textual_invembedding from the vocabulary, after which _projected_textual_invembedding is fed into the subsequent text encoder's processing pipeline (ignoring the start of text token and the end of text token).
Backward: _projected_textual_invembedding receives its gradient.
on_after_backward: Based on the gradient of _projected_textual_invembedding, gradient descent is performed on _textual_invembedding. Therefore, within this function, _textual_invembedding has already been updated via gradient descent, rather than by LBFGS.
optimizer.step(): The LBFGS optimizer updates _projected_textual_invembedding. However, since _textual_invembedding is not on the list of parameters optimized by LBFGS, this process does not update _textual_invembedding.
optimizer.zero_grad(): Clears the gradients of all variables.

Taking into account that: (1) step 3 employs gradient descent to update _textual_invembedding; (2) step 4 does not update _textual_invembedding; and (3) after step 5 concludes, step 1 of the next training cycle will reset the value of _projected_textual_invembedding, causing updates made by LBFGS to _projected_textual_invembedding to be unused—ultimately, it seems that _textual_invembedding is being updated by gradient descent.

Have I overlooked any part of the code that could have led to a misunderstanding of the training process? If there are errors in my understanding of steps 1-5, I would greatly appreciate it if you could correct me.

s-mahajan commented 1 day ago

Hi, regarding the training procedure: Point 3 above: on_after_backward uses LBFGS is used to compute the gradient step. In the code this happens when you configure the optimizer. So yes, LBFGS is being used there. So the updates made using LBFGS are used to update first the textual_inv_embedding (using the LBFGS grad of projected_textual_inv_embedding). Following this, the projected_textual_inv_embedding is updated using the projection of textual_inv_embedding. Then we again compute the (LBFGS) grad wrt to projected_textual_inv_embedding and update the textual_inv_embedding.

Hope this helps,

Zhazhan commented 1 day ago

Got it, many thanks

Hi, regarding the training procedure: Point 3 above: on_after_backward uses LBFGS is used to compute the gradient step. In the code this happens when you configure the optimizer. So yes, LBFGS is being used there. So the updates made using LBFGS are used to update first the textual_inv_embedding (using the LBFGS grad of projected_textual_inv_embedding). Following this, the projected_textual_inv_embedding is updated using the projection of textual_inv_embedding. Then we again compute the (LBFGS) grad wrt to projected_textual_inv_embedding and update the textual_inv_embedding.

Hope this helps,

ubc-vision / Prompting-Hard-Hardly-Prompting

Key Difference in Optimization Between Implementation and Paper #3