huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.23k stars 220 forks source link

extracting embeddings from a trained SetFit model. #245

Closed moonisali closed 11 months ago

moonisali commented 1 year ago

Hey First of All, Thank You For This Great Package!

IMy task relates to semantic similarity, in which I find 'closeness' of a query sentence to a list of candidate sentences. Something like shown here I wanted to know if there was a way to extract embeddings from a 'trained SetFit' model and then instead of utilizing the classification head just compute similarity of a given query sentences to the embeddings in SetFit.

Awaiting your answer, Thanks again

tomaarsen commented 1 year ago

Hello!

Yes, this is possible. As you can see in this snippet, a SetFitModel instance contains a model_body as well as a model_head: https://github.com/huggingface/setfit/blob/efef17e91f56fae611c221657bcd35d5123ac9fd/src/setfit/modeling.py#L188-L202

This body is always a SentenceTransformer, exactly like from the link that you sent. This means that you can perform the following:

from sentence_transformers import util
from setfit import SetFitModel

# Load model from the Hub
model = SetFitModel.pretrained(...)

# Optionally train the model
# trainer = SetFitTrainer(
#     model,
#     ...,
# )
# trainer.train()

# Copied and modified from https://www.sbert.net/docs/usage/semantic_textual_similarity.html
# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

# Compute embedding for both lists
embeddings1 = model.model_body.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.model_body.encode(sentences2, convert_to_tensor=True)

# Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings1, embeddings2)
tomaarsen commented 11 months ago

439 will introduce SetFitModel.encode(...) for getting the embeddings from a SetFit model (or rather, from its finetuned Sentence Transformer body). It should be included in the upcoming release this week!

tomaarsen commented 11 months ago

Closed via #439