FLAVA returns unpooled embeddings by mistake

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

131.79k stars 26.24k forks source link

FLAVA returns unpooled embeddings by mistake #26064

Closed morrisalp closed 10 months ago

morrisalp commented 1 year ago

System Info

transformers v4.33.0

Who can help?

@ArthurZucker @younesbelkada @amyeroberts

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

FLAVA models' get_text_features, get_image_features, and related functions return unpooled embeddings of shape (batch_size, n_tokens, hidden_size) rather than pooled (batch_size, hidden_size) as expected and stated in documentation. Note the bug in the source code HERE:

pooled_output = text_outputs[0]  # last_hidden_state

Actually, text_outputs has ordered keys last_hidden_state and pooler_output in that order, the former being unpooled and the latter pooled.

Expected behavior

Should returned pooled embeddings. Presumably this also affects other model methods using this such as the contrastive loss calculation.

ArthurZucker commented 12 months ago

Hey! I don't really think this is a bug, as the comment mentions: pooled_output = text_outputs[0] # last_hidden_state maybe the name of the temporary variable pooled_outputs is wrong as it is using the last hidden states. The documentation for the pooled_output states the following:

pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)): Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

when you want the text / image embeddings you want all the embedding not just the pooled one.

morrisalp commented 12 months ago

The documentation states e.g. for get_text_features:

Returns:
            text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`):
(etc)

This is wrong since it returns a tensor of shape (batch_size, n_tokens, hidden_size)

when you want the text / image embeddings you want all the embedding not just the pooled one

Not for my specific use case... Additionally the CLIP API functions like get_text_features also return pooled embeddings.

LysandreJik commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.