rinongal / textual_inversion

MIT License
2.87k stars 278 forks source link

FrozenCLIPEmbedder::text_encoder_forward is implemented differently from CLIPTextTransformer::forward #125

Closed askerlee closed 1 year ago

askerlee commented 1 year ago

I understand that FrozenCLIPEmbedder::text_encoder_forward() is to replace CLIPTextTransformer::forward() (more specifically, FrozenCLIPEmbedder.transformer.text_model.forward), by introducing an extra argument embedding_manager, and keeping other code logic unchanged. However, it seems the text_encoder_forward() is implemented quite differently from CLIPTextTransformer::forward().

Specifically, in Line 281,

last_hidden_state = self.encoder(
          ....
)

last_hidden_state = self.final_layer_norm(last_hidden_state) 

But in huggingface transformers, we see https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_clip.py#L730

encoder_outputs = self.encoder(
       ....
)

last_hidden_state = encoder_outputs[0]
last_hidden_state = self.final_layer_norm(last_hidden_state)

They don't seem to be equivalent? Is this a potential bug? Thanks.

askerlee commented 1 year ago

Sorry I realized CLIPEncoder::forward() is also overloaded. It returns hidden_states only, instead of a tuple (hidden_states, encoder_states, all_attentions) in huggingface transformers. Therefore, the equivalence remains.